intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
@ 2020-03-10 21:41 Francisco Jerez
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit Francisco Jerez
                   ` (14 more replies)
  0 siblings, 15 replies; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

This is my second take on improving the energy efficiency of the
intel_pstate driver under IO-bound conditions.  The problem and
approach to solve it are roughly the same as in my previous series [1]
at a high level:

In IO-bound scenarios (by definition) the throughput of the system
doesn't improve with increasing CPU frequency beyond the threshold
value at which the IO device becomes the bottleneck, however with the
current governors (whether HWP is in use or not) the CPU frequency
tends to oscillate with the load, often with an amplitude far into the
turbo range, leading to severely reduced energy efficiency, which is
particularly problematic when a limited TDP budget is shared among a
number of cores running some multithreaded workload, or among a CPU
core and an integrated GPU.

Improving the energy efficiency of the CPU improves the throughput of
the system in such TDP-limited conditions.  See [4] for some
preliminary benchmark results from a Razer Blade Stealth 13 Late
2019/LY320 laptop with an Intel ICL processor and integrated graphics,
including throughput results that range up to a ~15% improvement and
performance-per-watt results up to a ~43% improvement (estimated via
RAPL).  Particularly the throughput results may vary substantially
from one platform to another depending on the TDP budget and the
balance of load between CPU and GPU.

One of the main differences relative to my previous version is that
the trade-off between energy efficiency and frequency ramp-up latency
is now exposed to device drivers through a new PM QoS class [It would
make sense to expose it to userspace too eventually but that's beyond
the purpose of this series].  The new PM QoS class provides a latency
target to CPUFREQ governors which gives them permission to filter out
CPU frequency oscillations with a period significantly shorter than
the specified target, whenever doing so leads to improved energy
efficiency.

This series takes advantage of the new PM QoS class from the i915
driver whenever the driver determines that the GPU has become a
bottleneck for an extended period of time.  At that point it places a
PM QoS ramp-up latency target which causes CPUFREQ to limit the CPU to
a reasonably energy-efficient frequency able to at least achieve the
required amount of work in a time window approximately equal to the
ramp-up latency target (since any longer-term energy efficiency
optimization would potentially violate the latency target).  This
seems more effective than clamping the CPU frequency to a fixed value
directly from various subsystems, since the CPU is a shared resource,
so the frequency bound needs to consider the load and latency
requirements of all independent workloads running on the same CPU core
in order to avoid performance degradation in a multitasking, possibly
virtualized environment.

The main limitation of this PM QoS approach is that whenever multiple
clients request different ramp-up latency targets, only the strictest
(lowest latency) one will apply system-wide, potentially leading to
suboptimal energy efficiency for the less latency-sensitive clients,
(though it won't artificially limit the CPU throughput of the most
latency-sensitive clients as a result of the PM QoS requests placed by
less latency-sensitive ones).  In order to address this limitation I'm
working on a more complicated solution which integrates with the task
scheduler in order to provide response latency control with process
granularity (pretty much in the spirit of PELT).  One of the
alternatives Rafael and I were discussing was to expose that through a
third cgroup clamp on top of the MIN and MAX utilization clamps, but
I'm open to any other possibilities regarding what the interface
should look like.  Either way the current (scheduling-unaware) PM
QoS-based interface should provide most of the benefit except in
heavily multitasking environments.

A branch with this series in testable form can be found here [2],
based on linux-next from a few days ago.  Another important difference
with respect to my previous revision is that the present one targets
HWP systems (though for the moment it's only enabled by default on
ICL, even though that can be overridden through the kernel command
line).  I have WIP code that uses the same governor in order to
provide a similar benefit on non-HWP systems (like my previous
revision), which can be found in this branch for reference [3] -- I'm
planning to finish that up and send it as follow-up to this series
assuming people are happy with the overall approach.

Thanks in advance for any review feed-back and test reports.

[PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
[PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
[PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
[PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
[PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation.
[PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation.
[PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
[PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
[PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
[PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.

[1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
[2] https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
[3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
[4] http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
@ 2020-03-10 21:41 ` Francisco Jerez
  2020-03-11 12:42   ` Peter Zijlstra
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load Francisco Jerez
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

The purpose of this PM QoS limit is to give device drivers additional
control over the latency/energy efficiency trade-off made by the PM
subsystem (particularly the CPUFREQ governor).  It allows device
drivers to set a lower bound on the response latency of PM (defined as
the time it takes from wake-up to the CPU reaching a certain
steady-state level of performance [e.g. the nominal frequency] in
response to a step-function load).  It reports to PM the minimum
ramp-up latency considered of use to the application, and explicitly
requests PM to filter out oscillations faster than the specified
frequency.  It is somewhat complementary to the current
CPU_DMA_LATENCY PM QoS class which can be understood as specifying an
upper latency bound on the CPU wake-up time, instead of a lower bound
on the CPU frequency ramp-up time.

Note that even though this provides a latency constraint it's
represented as its reciprocal in Hz units for computational efficiency
(since it would take a 64-bit division to compute the number of cycles
elapsed from a time increment in nanoseconds and a time bound, while a
frequency can simply be multiplied with the time increment).

This implements a MAX constraint so that the strictest (highest
response frequency) request is honored.  This means that PM won't
provide any guarantee that frequencies greater than the specified
bound will be filtered, since that might be incompatible with the
constraints specified by another more latency-sensitive application (A
more fine-grained result could be achieved with a scheduling-based
interface).  The default value needs to be equal to zero (best effort)
for it to behave as identity of the MAX operation.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 include/linux/pm_qos.h       |   9 +++
 include/trace/events/power.h |  33 ++++----
 kernel/power/qos.c           | 141 ++++++++++++++++++++++++++++++++++-
 3 files changed, 165 insertions(+), 18 deletions(-)

diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 4a69d4af3ff8..b522e2194c05 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -28,6 +28,7 @@ enum pm_qos_flags_status {
 #define PM_QOS_LATENCY_ANY_NS	((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC)
 
 #define PM_QOS_CPU_LATENCY_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
+#define PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE 0
 #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS	PM_QOS_LATENCY_ANY_NS
@@ -162,6 +163,14 @@ static inline void cpu_latency_qos_update_request(struct pm_qos_request *req,
 static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) {}
 #endif
 
+s32 cpu_response_frequency_qos_limit(void);
+bool cpu_response_frequency_qos_request_active(struct pm_qos_request *req);
+void cpu_response_frequency_qos_add_request(struct pm_qos_request *req,
+					    s32 value);
+void cpu_response_frequency_qos_update_request(struct pm_qos_request *req,
+					       s32 new_value);
+void cpu_response_frequency_qos_remove_request(struct pm_qos_request *req);
+
 #ifdef CONFIG_PM
 enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
 enum pm_qos_flags_status dev_pm_qos_flags(struct device *dev, s32 mask);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index af5018aa9517..7e4b52e8ca3a 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -359,45 +359,48 @@ DEFINE_EVENT(power_domain, power_domain_target,
 );
 
 /*
- * CPU latency QoS events used for global CPU latency QoS list updates
+ * CPU latency/response frequency QoS events used for global CPU PM
+ * QoS list updates.
  */
-DECLARE_EVENT_CLASS(cpu_latency_qos_request,
+DECLARE_EVENT_CLASS(pm_qos_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value),
+	TP_ARGS(name, value),
 
 	TP_STRUCT__entry(
+		__string(name,			 name		)
 		__field( s32,                    value          )
 	),
 
 	TP_fast_assign(
+		__assign_str(name, name);
 		__entry->value = value;
 	),
 
-	TP_printk("CPU_DMA_LATENCY value=%d",
-		  __entry->value)
+	TP_printk("pm_qos_class=%s value=%d",
+		  __get_str(name), __entry->value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_add_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_add_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_update_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_update_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_remove_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_remove_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
 /*
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index 32927682bcc4..018491fecaac 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -271,7 +271,7 @@ void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value)
 		return;
 	}
 
-	trace_pm_qos_add_request(value);
+	trace_pm_qos_add_request("CPU_DMA_LATENCY", value);
 
 	req->qos = &cpu_latency_constraints;
 	cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value);
@@ -297,7 +297,7 @@ void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value)
 		return;
 	}
 
-	trace_pm_qos_update_request(new_value);
+	trace_pm_qos_update_request("CPU_DMA_LATENCY", new_value);
 
 	if (new_value == req->node.prio)
 		return;
@@ -323,7 +323,7 @@ void cpu_latency_qos_remove_request(struct pm_qos_request *req)
 		return;
 	}
 
-	trace_pm_qos_remove_request(PM_QOS_DEFAULT_VALUE);
+	trace_pm_qos_remove_request("CPU_DMA_LATENCY", PM_QOS_DEFAULT_VALUE);
 
 	cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
 	memset(req, 0, sizeof(*req));
@@ -424,6 +424,141 @@ static int __init cpu_latency_qos_init(void)
 late_initcall(cpu_latency_qos_init);
 #endif /* CONFIG_CPU_IDLE */
 
+/* Definitions related to the CPU response frequency QoS. */
+
+static struct pm_qos_constraints cpu_response_frequency_constraints = {
+	.list = PLIST_HEAD_INIT(cpu_response_frequency_constraints.list),
+	.target_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
+	.default_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
+	.no_constraint_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
+	.type = PM_QOS_MAX,
+};
+
+/**
+ * cpu_response_frequency_qos_limit - Return current system-wide CPU
+ *				      response frequency QoS limit.
+ */
+s32 cpu_response_frequency_qos_limit(void)
+{
+	return pm_qos_read_value(&cpu_response_frequency_constraints);
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_limit);
+
+/**
+ * cpu_response_frequency_qos_request_active - Check the given PM QoS request.
+ * @req: PM QoS request to check.
+ *
+ * Return: 'true' if @req has been added to the CPU response frequency
+ * QoS list, 'false' otherwise.
+ */
+bool cpu_response_frequency_qos_request_active(struct pm_qos_request *req)
+{
+	return req->qos == &cpu_response_frequency_constraints;
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_request_active);
+
+static void cpu_response_frequency_qos_apply(struct pm_qos_request *req,
+					     enum pm_qos_req_action action,
+					     s32 value)
+{
+	int ret = pm_qos_update_target(req->qos, &req->node, action, value);
+
+	if (ret > 0)
+		wake_up_all_idle_cpus();
+}
+
+/**
+ * cpu_response_frequency_qos_add_request - Add new CPU response
+ *					    frequency QoS request.
+ * @req: Pointer to a preallocated handle.
+ * @value: Requested constraint value.
+ *
+ * Use @value to initialize the request handle pointed to by @req,
+ * insert it as a new entry to the CPU response frequency QoS list and
+ * recompute the effective QoS constraint for that list.
+ *
+ * Callers need to save the handle for later use in updates and removal of the
+ * QoS request represented by it.
+ */
+void cpu_response_frequency_qos_add_request(struct pm_qos_request *req,
+					    s32 value)
+{
+	if (!req)
+		return;
+
+	if (cpu_response_frequency_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for already added request\n",
+		     __func__);
+		return;
+	}
+
+	trace_pm_qos_add_request("CPU_RESPONSE_FREQUENCY", value);
+
+	req->qos = &cpu_response_frequency_constraints;
+	cpu_response_frequency_qos_apply(req, PM_QOS_ADD_REQ, value);
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_add_request);
+
+/**
+ * cpu_response_frequency_qos_update_request - Modify existing CPU
+ *					       response frequency QoS
+ *					       request.
+ * @req : QoS request to update.
+ * @new_value: New requested constraint value.
+ *
+ * Use @new_value to update the QoS request represented by @req in the
+ * CPU response frequency QoS list along with updating the effective
+ * constraint value for that list.
+ */
+void cpu_response_frequency_qos_update_request(struct pm_qos_request *req,
+					       s32 new_value)
+{
+	if (!req)
+		return;
+
+	if (!cpu_response_frequency_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_update_request("CPU_RESPONSE_FREQUENCY", new_value);
+
+	if (new_value == req->node.prio)
+		return;
+
+	cpu_response_frequency_qos_apply(req, PM_QOS_UPDATE_REQ, new_value);
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_update_request);
+
+/**
+ * cpu_response_frequency_qos_remove_request - Remove existing CPU
+ *					       response frequency QoS
+ *					       request.
+ * @req: QoS request to remove.
+ *
+ * Remove the CPU response frequency QoS request represented by @req
+ * from the CPU response frequency QoS list along with updating the
+ * effective constraint value for that list.
+ */
+void cpu_response_frequency_qos_remove_request(struct pm_qos_request *req)
+{
+	if (!req)
+		return;
+
+	if (!cpu_response_frequency_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_remove_request("CPU_RESPONSE_FREQUENCY",
+				    PM_QOS_DEFAULT_VALUE);
+
+	cpu_response_frequency_qos_apply(req, PM_QOS_REMOVE_REQ,
+					 PM_QOS_DEFAULT_VALUE);
+	memset(req, 0, sizeof(*req));
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_remove_request);
+
 /* Definitions related to the frequency QoS below. */
 
 /**
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit Francisco Jerez
@ 2020-03-10 21:41 ` Francisco Jerez
  2020-03-10 22:26   ` Chris Wilson
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs Francisco Jerez
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

This allows CPUFREQ governors to realize when the system becomes
non-CPU-bound due to GPU rendering activity, and cause them to respond
more conservatively to the workload by limiting their response
frequency: CPU energy usage will be reduced when there isn't a good
chance for system performance to scale with CPU frequency due to the
GPU bottleneck.  This leaves additional TDP budget available for the
GPU to reach higher frequencies, which is translated into an
improvement in graphics performance to the extent that the workload
remains TDP-limited (Most non-trivial graphics benchmarks out there
improve significantly in the TDP-constrained platforms where this is
currently enabled, see the cover letter for some numbers).  If the
workload isn't (anymore) TDP-limited performance should stay roughly
constant, but energy usage will be divided by a similar factor.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c    |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h |   7 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c        | 107 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h        |   3 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h     |  12 +++
 drivers/gpu/drm/i915/gt/intel_lrc.c          |  14 +++
 6 files changed, 144 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 53ac3f00909a..16ebdfa1dfc9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -504,6 +504,7 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine)
 
 	execlists->queue_priority_hint = INT_MIN;
 	execlists->queue = RB_ROOT_CACHED;
+	atomic_set(&execlists->overload, 0);
 }
 
 static void cleanup_status_page(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 80cdde712842..1b17b2f0c7a3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -266,6 +266,13 @@ struct intel_engine_execlists {
 	 */
 	u8 csb_head;
 
+	/**
+	 * @overload: whether at least two execlist ports are
+	 * currently submitted to the hardware, indicating that CPU
+	 * latency isn't critical in order to maintain the GPU busy.
+	 */
+	atomic_t overload;
+
 	I915_SELFTEST_DECLARE(struct st_preempt_hang preempt_hang;)
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 8b653c0f5e5f..f1f859e89a8f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -107,6 +107,102 @@ void intel_gt_pm_init_early(struct intel_gt *gt)
 	intel_wakeref_init(&gt->wakeref, gt->uncore->rpm, &wf_ops);
 }
 
+/**
+ * Time increment until the most immediate PM QoS response frequency
+ * update.
+ *
+ * May be in the future (return value > 0) if the GPU is currently
+ * active but we haven't updated the PM QoS request to reflect a
+ * bottleneck yet.  May be in the past (return value < 0) if the GPU
+ * isn't fully utilized and we've already reset the PM QoS request to
+ * the default value.  May be zero if a PM QoS request update is due.
+ *
+ * The time increment returned by this function decreases linearly
+ * with time until it reaches either zero or a configurable limit.
+ */
+static int32_t time_to_rf_qos_update_ns(struct intel_gt *gt)
+{
+	const uint64_t t1 = ktime_get_ns();
+	const uint64_t dt1 = gt->rf_qos.delay_max_ns;
+
+	if (atomic_read_acquire(&gt->rf_qos.active_count)) {
+		const uint64_t t0 = atomic64_read(&gt->rf_qos.time_set_ns);
+
+		return min(dt1, t0 <= t1 ? 0 : t0 - t1);
+	} else {
+		const uint64_t t0 = atomic64_read(&gt->rf_qos.time_clear_ns);
+		const unsigned int shift = gt->rf_qos.delay_slope_shift;
+
+		return -(int32_t)(t1 <= t0 ? 1 :
+				  min(dt1, (t1 - t0) << shift));
+	}
+}
+
+/**
+ * Perform a delayed PM QoS response frequency update.
+ */
+static void intel_gt_rf_qos_update(struct intel_gt *gt)
+{
+	const uint32_t dt = max(0, time_to_rf_qos_update_ns(gt));
+
+	timer_reduce(&gt->rf_qos.timer, jiffies + nsecs_to_jiffies(dt));
+}
+
+/**
+ * Timer that fires once the delay used to switch the PM QoS response
+ * frequency request has elapsed.
+ */
+static void intel_gt_rf_qos_timeout(struct timer_list *timer)
+{
+	struct intel_gt *gt = container_of(timer, struct intel_gt,
+					   rf_qos.timer);
+	const int32_t dt = time_to_rf_qos_update_ns(gt);
+
+	if (dt == 0)
+		cpu_response_frequency_qos_update_request(
+			&gt->rf_qos.req, gt->rf_qos.target_hz);
+	else
+		cpu_response_frequency_qos_update_request(
+			&gt->rf_qos.req, PM_QOS_DEFAULT_VALUE);
+
+	if (dt > 0)
+		intel_gt_rf_qos_update(gt);
+}
+
+/**
+ * Report the beginning of a period of GPU utilization to PM.
+ *
+ * May trigger a more energy-efficient response mode in CPU PM, but
+ * only after a certain delay has elapsed so we don't have a negative
+ * impact on the CPU ramp-up latency except after the GPU has been
+ * continuously utilized for a long enough period of time.
+ */
+void intel_gt_pm_active_begin(struct intel_gt *gt)
+{
+	const uint32_t dt = abs(time_to_rf_qos_update_ns(gt));
+
+	atomic64_set(&gt->rf_qos.time_set_ns, ktime_get_ns() + dt);
+
+	if (!atomic_fetch_inc_release(&gt->rf_qos.active_count))
+		intel_gt_rf_qos_update(gt);
+}
+
+/**
+ * Report the end of a period of GPU utilization to PM.
+ *
+ * Must be called once after each call to intel_gt_pm_active_begin().
+ */
+void intel_gt_pm_active_end(struct intel_gt *gt)
+{
+	const uint32_t dt = abs(time_to_rf_qos_update_ns(gt));
+	const unsigned int shift = gt->rf_qos.delay_slope_shift;
+
+	atomic64_set(&gt->rf_qos.time_clear_ns, ktime_get_ns() - (dt >> shift));
+
+	if (!atomic_dec_return_release(&gt->rf_qos.active_count))
+		intel_gt_rf_qos_update(gt);
+}
+
 void intel_gt_pm_init(struct intel_gt *gt)
 {
 	/*
@@ -116,6 +212,14 @@ void intel_gt_pm_init(struct intel_gt *gt)
 	 */
 	intel_rc6_init(&gt->rc6);
 	intel_rps_init(&gt->rps);
+
+	cpu_response_frequency_qos_add_request(&gt->rf_qos.req,
+					       PM_QOS_DEFAULT_VALUE);
+
+	gt->rf_qos.delay_max_ns = 250000;
+	gt->rf_qos.delay_slope_shift = 0;
+	gt->rf_qos.target_hz = 2;
+	timer_setup(&gt->rf_qos.timer, intel_gt_rf_qos_timeout, 0);
 }
 
 static bool reset_engines(struct intel_gt *gt)
@@ -170,6 +274,9 @@ static void gt_sanitize(struct intel_gt *gt, bool force)
 
 void intel_gt_pm_fini(struct intel_gt *gt)
 {
+	del_timer_sync(&gt->rf_qos.timer);
+	cpu_response_frequency_qos_remove_request(&gt->rf_qos.req);
+
 	intel_rc6_fini(&gt->rc6);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 60f0e2fbe55c..43f1d45fb0db 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -58,6 +58,9 @@ int intel_gt_resume(struct intel_gt *gt);
 void intel_gt_runtime_suspend(struct intel_gt *gt);
 int intel_gt_runtime_resume(struct intel_gt *gt);
 
+void intel_gt_pm_active_begin(struct intel_gt *gt);
+void intel_gt_pm_active_end(struct intel_gt *gt);
+
 static inline bool is_mock_gt(const struct intel_gt *gt)
 {
 	return I915_SELFTEST_ONLY(gt->awake == -ENODEV);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 96890dd12b5f..4bc80c55e6f0 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -10,6 +10,7 @@
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pm_qos.h>
 #include <linux/spinlock.h>
 #include <linux/types.h>
 
@@ -97,6 +98,17 @@ struct intel_gt {
 	 * Reserved for exclusive use by the kernel.
 	 */
 	struct i915_address_space *vm;
+
+	struct {
+		struct pm_qos_request req;
+		struct timer_list timer;
+		uint32_t target_hz;
+		uint32_t delay_max_ns;
+		uint32_t delay_slope_shift;
+		atomic64_t time_set_ns;
+		atomic64_t time_clear_ns;
+		atomic_t active_count;
+	} rf_qos;
 };
 
 enum intel_gt_scratch_field {
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index b9b3f78f1324..a5d7a80b826d 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 	/* we need to manually load the submit queue */
 	if (execlists->ctrl_reg)
 		writel(EL_CTRL_LOAD, execlists->ctrl_reg);
+
+	if (execlists_num_ports(execlists) > 1 &&
+	    execlists->pending[1] &&
+	    !atomic_xchg(&execlists->overload, 1))
+		intel_gt_pm_active_begin(&engine->i915->gt);
 }
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
@@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
 	clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
 
 	WRITE_ONCE(execlists->active, execlists->inflight);
+
+	if (atomic_xchg(&execlists->overload, 0)) {
+		struct intel_engine_cs *engine =
+			container_of(execlists, typeof(*engine), execlists);
+		intel_gt_pm_active_end(&engine->i915->gt);
+	}
 }
 
 static inline void
@@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
 			/* port0 completed, advanced to port1 */
 			trace_ports(execlists, "completed", execlists->active);
 
+			if (atomic_xchg(&execlists->overload, 0))
+				intel_gt_pm_active_end(&engine->i915->gt);
+
 			/*
 			 * We rely on the hardware being strongly
 			 * ordered, that the breadcrumb write is
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit Francisco Jerez
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load Francisco Jerez
@ 2020-03-10 21:41 ` Francisco Jerez
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 69 +++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 8f2525e4ce0f..e5c27b9302d9 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1745,6 +1745,72 @@ static const struct file_operations i915_guc_log_relay_fops = {
 	.release = i915_guc_log_relay_release,
 };
 
+static int
+i915_rf_qos_delay_max_ns_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.rf_qos.delay_max_ns, val);
+	return 0;
+}
+
+static int
+i915_rf_qos_delay_max_ns_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.rf_qos.delay_max_ns);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_rf_qos_delay_max_ns_fops,
+			i915_rf_qos_delay_max_ns_get,
+			i915_rf_qos_delay_max_ns_set, "%llu\n");
+
+static int
+i915_rf_qos_delay_slope_shift_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.rf_qos.delay_slope_shift, val);
+	return 0;
+}
+
+static int
+i915_rf_qos_delay_slope_shift_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.rf_qos.delay_slope_shift);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_rf_qos_delay_slope_shift_fops,
+			i915_rf_qos_delay_slope_shift_get,
+			i915_rf_qos_delay_slope_shift_set, "%llu\n");
+
+static int
+i915_rf_qos_target_hz_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.rf_qos.target_hz, val);
+	return 0;
+}
+
+static int
+i915_rf_qos_target_hz_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.rf_qos.target_hz);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_rf_qos_target_hz_fops,
+			i915_rf_qos_target_hz_get,
+			i915_rf_qos_target_hz_set, "%llu\n");
+
 static int i915_runtime_pm_status(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -2390,6 +2456,9 @@ static const struct i915_debugfs_files {
 #endif
 	{"i915_guc_log_level", &i915_guc_log_level_fops},
 	{"i915_guc_log_relay", &i915_guc_log_relay_fops},
+	{"i915_rf_qos_delay_max_ns", &i915_rf_qos_delay_max_ns_fops},
+	{"i915_rf_qos_delay_slope_shift", &i915_rf_qos_delay_slope_shift_fops},
+	{"i915_rf_qos_target_hz", &i915_rf_qos_target_hz_fops}
 };
 
 int i915_debugfs_register(struct drm_i915_private *dev_priv)
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (2 preceding siblings ...)
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs Francisco Jerez
@ 2020-03-10 21:41 ` Francisco Jerez
  2020-03-19 10:45   ` Rafael J. Wysocki
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation Francisco Jerez
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

This reverts commit c4f3f70cacba2fa19545389a12d09b606d2ad1cf.  A
future commit will introduce a new update_util implementation, so the
pstate_funcs table entry is going to be useful.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 7fa869004cf0..8cb5bf419b40 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -277,6 +277,7 @@ static struct cpudata **all_cpu_data;
  * @get_scaling:	Callback to get frequency scaling factor
  * @get_val:		Callback to convert P state to actual MSR write value
  * @get_vid:		Callback to get VID data for Atom platforms
+ * @update_util:	Active mode utilization update callback.
  *
  * Core and Atom CPU models have different way to get P State limits. This
  * structure is used to store those callbacks.
@@ -290,6 +291,8 @@ struct pstate_funcs {
 	int (*get_aperf_mperf_shift)(void);
 	u64 (*get_val)(struct cpudata*, int pstate);
 	void (*get_vid)(struct cpudata *);
+	void (*update_util)(struct update_util_data *data, u64 time,
+			    unsigned int flags);
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
@@ -1877,6 +1880,7 @@ static struct pstate_funcs core_funcs = {
 	.get_turbo = core_get_turbo_pstate,
 	.get_scaling = core_get_scaling,
 	.get_val = core_get_val,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs silvermont_funcs = {
@@ -1887,6 +1891,7 @@ static const struct pstate_funcs silvermont_funcs = {
 	.get_val = atom_get_val,
 	.get_scaling = silvermont_get_scaling,
 	.get_vid = atom_get_vid,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs airmont_funcs = {
@@ -1897,6 +1902,7 @@ static const struct pstate_funcs airmont_funcs = {
 	.get_val = atom_get_val,
 	.get_scaling = airmont_get_scaling,
 	.get_vid = atom_get_vid,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs knl_funcs = {
@@ -1907,6 +1913,7 @@ static const struct pstate_funcs knl_funcs = {
 	.get_aperf_mperf_shift = knl_get_aperf_mperf_shift,
 	.get_scaling = core_get_scaling,
 	.get_val = core_get_val,
+	.update_util = intel_pstate_update_util,
 };
 
 #define ICPU(model, policy) \
@@ -2013,9 +2020,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 	/* Prevent intel_pstate_update_util() from using stale data. */
 	cpu->sample.time = 0;
 	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-				     (hwp_active ?
-				      intel_pstate_update_util_hwp :
-				      intel_pstate_update_util));
+				     pstate_funcs.update_util);
 	cpu->update_util_set = true;
 }
 
@@ -2584,6 +2589,7 @@ static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
 	pstate_funcs.get_scaling = funcs->get_scaling;
 	pstate_funcs.get_val   = funcs->get_val;
 	pstate_funcs.get_vid   = funcs->get_vid;
+	pstate_funcs.update_util = funcs->update_util;
 	pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
 }
 
@@ -2750,8 +2756,11 @@ static int __init intel_pstate_init(void)
 	id = x86_match_cpu(hwp_support_ids);
 	if (id) {
 		copy_cpu_funcs(&core_funcs);
-		if (!no_hwp) {
+		if (no_hwp) {
+			pstate_funcs.update_util = intel_pstate_update_util;
+		} else {
 			hwp_active++;
+			pstate_funcs.update_util = intel_pstate_update_util_hwp;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
 			goto hwp_cpu_matched;
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (3 preceding siblings ...)
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
@ 2020-03-10 21:41 ` Francisco Jerez
  2020-03-19 11:06   ` Rafael J. Wysocki
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation Francisco Jerez
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

The goal of the helper code introduced here is to compute two
informational data structures: struct vlp_input_stats aggregating
various scheduling and PM statistics gathered in every call of the
update_util() hook, and struct vlp_status_sample which contains status
information derived from the former indicating whether the system is
likely to have an IO or CPU bottleneck.  This will be used as main
heuristic input by the new variably low-pass filtering controller (AKA
VLP) that will assist the HWP at finding a reasonably energy-efficient
P-state given the additional information available to the kernel about
I/O utilization and scheduling behavior.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 230 +++++++++++++++++++++++++++++++++
 1 file changed, 230 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 8cb5bf419b40..12ee350db2a9 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -19,6 +19,7 @@
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/debugfs.h>
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/fs.h>
@@ -33,6 +34,8 @@
 #include <asm/cpufeature.h>
 #include <asm/intel-family.h>
 
+#include "../../kernel/sched/sched.h"
+
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
@@ -59,6 +62,11 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
 	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
 }
 
+static inline int rnd_fp(int32_t x)
+{
+	return (x + (1 << (FRAC_BITS - 1))) >> FRAC_BITS;
+}
+
 static inline int32_t div_fp(s64 x, s64 y)
 {
 	return div64_s64((int64_t)x << FRAC_BITS, y);
@@ -169,6 +177,49 @@ struct vid_data {
 	int32_t ratio;
 };
 
+/**
+ * Scheduling and PM statistics gathered by update_vlp_sample() at
+ * every call of the VLP update_state() hook, used as heuristic
+ * inputs.
+ */
+struct vlp_input_stats {
+	int32_t realtime_count;
+	int32_t io_wait_count;
+	uint32_t max_response_frequency_hz;
+	uint32_t last_response_frequency_hz;
+};
+
+enum vlp_status {
+	VLP_BOTTLENECK_IO = 1 << 0,
+	/*
+	 * XXX - Add other status bits here indicating a CPU or TDP
+	 * bottleneck.
+	 */
+};
+
+/**
+ * Heuristic status information calculated by get_vlp_status_sample()
+ * from struct vlp_input_stats above, indicating whether the system
+ * has a potential IO or latency bottleneck.
+ */
+struct vlp_status_sample {
+	enum vlp_status value;
+	int32_t realtime_avg;
+};
+
+/**
+ * struct vlp_data - VLP controller parameters and state.
+ * @sample_interval_ns:	 Update interval in ns.
+ * @sample_frequency_hz: Reciprocal of the update interval in Hz.
+ */
+struct vlp_data {
+	s64 sample_interval_ns;
+	int32_t sample_frequency_hz;
+
+	struct vlp_input_stats stats;
+	struct vlp_status_sample status;
+};
+
 /**
  * struct global_params - Global parameters, mostly tunable via sysfs.
  * @no_turbo:		Whether or not to use turbo P-states.
@@ -239,6 +290,7 @@ struct cpudata {
 
 	struct pstate_data pstate;
 	struct vid_data vid;
+	struct vlp_data vlp;
 
 	u64	last_update;
 	u64	last_sample_time;
@@ -268,6 +320,18 @@ struct cpudata {
 
 static struct cpudata **all_cpu_data;
 
+/**
+ * struct vlp_params - VLP controller static configuration
+ * @sample_interval_ms:	     Update interval in ms.
+ * @avg*_hz:		     Exponential averaging frequencies of the various
+ *			     low-pass filters as an integer in Hz.
+ */
+struct vlp_params {
+	int sample_interval_ms;
+	int avg_hz;
+	int debug;
+};
+
 /**
  * struct pstate_funcs - Per CPU model specific callbacks
  * @get_max:		Callback to get maximum non turbo effective P state
@@ -296,6 +360,11 @@ struct pstate_funcs {
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
+static struct vlp_params vlp_params __read_mostly = {
+	.sample_interval_ms = 10,
+	.avg_hz = 2,
+	.debug = 0,
+};
 
 static int hwp_active __read_mostly;
 static int hwp_mode_bdw __read_mostly;
@@ -1793,6 +1862,167 @@ static inline int32_t get_target_pstate(struct cpudata *cpu)
 	return target;
 }
 
+/**
+ * Initialize the struct vlp_data of the specified CPU to the defaults
+ * calculated from @vlp_params.
+ */
+static void intel_pstate_reset_vlp(struct cpudata *cpu)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+
+	vlp->sample_interval_ns = vlp_params.sample_interval_ms * NSEC_PER_MSEC;
+	vlp->sample_frequency_hz = max(1u, (uint32_t)MSEC_PER_SEC /
+					   vlp_params.sample_interval_ms);
+	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
+}
+
+/**
+ * Fixed point representation with twice the usual number of
+ * fractional bits.
+ */
+#define DFRAC_BITS 16
+#define DFRAC_ONE (1 << DFRAC_BITS)
+#define DFRAC_MAX_INT (0u - (uint32_t)DFRAC_ONE)
+
+/**
+ * Fast but rather inaccurate piecewise-linear approximation of a
+ * fixed-point inverse exponential:
+ *
+ *  exp2n(p) = int_tofp(1) * 2 ^ (-p / DFRAC_ONE) + O(1)
+ *
+ * The error term should be lower in magnitude than 0.044.
+ */
+static int32_t exp2n(uint32_t p)
+{
+	if (p < 32 * DFRAC_ONE) {
+		/* Interpolate between 2^-floor(p) and 2^-ceil(p). */
+		const uint32_t floor_p = p >> DFRAC_BITS;
+		const uint32_t ceil_p = (p + DFRAC_ONE - 1) >> DFRAC_BITS;
+		const uint64_t frac_p = p - (floor_p << DFRAC_BITS);
+
+		return ((int_tofp(1) >> floor_p) * (DFRAC_ONE - frac_p) +
+			(ceil_p >= 32 ? 0 : int_tofp(1) >> ceil_p) * frac_p) >>
+			DFRAC_BITS;
+	}
+
+	/* Short-circuit to avoid overflow. */
+	return 0;
+}
+
+/**
+ * Calculate the exponential averaging weight for a new sample based
+ * on the requested averaging frequency @hz and the delay since the
+ * last update.
+ */
+static int32_t get_last_sample_avg_weight(struct cpudata *cpu, unsigned int hz)
+{
+	/*
+	 * Approximate, but saves several 64-bit integer divisions
+	 * below and should be fully evaluated at compile-time.
+	 * Causes the exponential averaging to have an effective base
+	 * of 1.90702343749, which has little functional implications
+	 * as long as the hz parameter is scaled accordingly.
+	 */
+	const uint32_t ns_per_s_shift = order_base_2(NSEC_PER_SEC);
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+
+	return exp2n(min((uint64_t)DFRAC_MAX_INT,
+			 (hz * delta_ns) >> (ns_per_s_shift - DFRAC_BITS)));
+}
+
+/**
+ * Calculate some status information heuristically based on the struct
+ * vlp_input_stats statistics gathered by the update_state() hook.
+ */
+static const struct vlp_status_sample *get_vlp_status_sample(
+	struct cpudata *cpu, const int32_t po)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+	struct vlp_input_stats *stats = &vlp->stats;
+	struct vlp_status_sample *last_status = &vlp->status;
+
+	/*
+	 * Calculate the VLP_BOTTLENECK_IO state bit, which indicates
+	 * whether some IO device driver has requested a PM response
+	 * frequency bound, typically due to the device being under
+	 * close to full utilization, which should cause the
+	 * controller to make a more conservative trade-off between
+	 * latency and energy usage, since performance isn't
+	 * guaranteed to scale further with increasing CPU frequency
+	 * whenever the system is close to IO-bound.
+	 *
+	 * Note that the maximum achievable response frequency is
+	 * limited by the sampling frequency of the controller,
+	 * response frequency requests greater than that will be
+	 * promoted to infinity (i.e. no low-pass filtering) in order
+	 * to avoid violating the response frequency constraint
+	 * provided via PM QoS.
+	 */
+	const bool bottleneck_io = stats->max_response_frequency_hz <
+				   vlp->sample_frequency_hz;
+
+	/*
+	 * Calculate the realtime statistic that tracks the
+	 * exponentially-averaged rate of occurrence of
+	 * latency-sensitive events (like wake-ups from IO wait).
+	 */
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+	const int32_t realtime_sample =
+		div_fp((uint64_t)(stats->realtime_count +
+				  (bottleneck_io ? 0 : stats->io_wait_count)) *
+		       NSEC_PER_SEC,
+		       100 * delta_ns);
+	const int32_t alpha = get_last_sample_avg_weight(cpu,
+							 vlp_params.avg_hz);
+	const int32_t realtime_avg = realtime_sample +
+		mul_fp(alpha, last_status->realtime_avg - realtime_sample);
+
+	/* Consume the input statistics. */
+	stats->io_wait_count = 0;
+	stats->realtime_count = 0;
+	if (bottleneck_io)
+		stats->last_response_frequency_hz =
+			stats->max_response_frequency_hz;
+	stats->max_response_frequency_hz = 0;
+
+	/* Update the state of the controller. */
+	last_status->realtime_avg = realtime_avg;
+	last_status->value = (bottleneck_io ? VLP_BOTTLENECK_IO : 0);
+
+	/* Update state used for tracing. */
+	cpu->sample.busy_scaled = int_tofp(stats->max_response_frequency_hz);
+	cpu->iowait_boost = realtime_avg;
+
+	return last_status;
+}
+
+/**
+ * Collect some scheduling and PM statistics in response to an
+ * update_state() call.
+ */
+static bool update_vlp_sample(struct cpudata *cpu, u64 time, unsigned int flags)
+{
+	struct vlp_input_stats *stats = &cpu->vlp.stats;
+
+	/* Update PM QoS request. */
+	const uint32_t resp_hz = cpu_response_frequency_qos_limit();
+
+	stats->max_response_frequency_hz = !resp_hz ? UINT_MAX :
+		max(stats->max_response_frequency_hz, resp_hz);
+
+	/* Update scheduling statistics. */
+	if ((flags & SCHED_CPUFREQ_IOWAIT))
+		stats->io_wait_count++;
+
+	if (cpu_rq(cpu->cpu)->rt.rt_nr_running)
+		stats->realtime_count++;
+
+	/* Return whether a P-state update is due. */
+	return smp_processor_id() == cpu->cpu &&
+		time - cpu->sample.time >= cpu->vlp.sample_interval_ns &&
+		intel_pstate_sample(cpu, time);
+}
+
 static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
 {
 	int min_pstate = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (4 preceding siblings ...)
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation Francisco Jerez
@ 2020-03-10 21:41 ` Francisco Jerez
  2020-03-19 11:12   ` Rafael J. Wysocki
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts Francisco Jerez
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:41 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

The function introduced here calculates a P-state range derived from
the statistics computed in the previous patch which will be used to
drive the HWP P-state range or (if HWP is not available) as basis for
some additional kernel-side frequency selection mechanism which will
choose a single P-state from the range.  This is meant to provide a
variably low-pass filtering effect that will damp oscillations below a
frequency threshold that can be specified by device drivers via PM QoS
in order to achieve energy-efficient behavior in cases where the
system has an IO bottleneck.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 157 +++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 12ee350db2a9..cecadfec8bc1 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -207,17 +207,34 @@ struct vlp_status_sample {
 	int32_t realtime_avg;
 };
 
+/**
+ * VLP controller state used for the estimation of the target P-state
+ * range, computed by get_vlp_target_range() from the heuristic status
+ * information defined above in struct vlp_status_sample.
+ */
+struct vlp_target_range {
+	unsigned int value[2];
+	int32_t p_base;
+};
+
 /**
  * struct vlp_data - VLP controller parameters and state.
  * @sample_interval_ns:	 Update interval in ns.
  * @sample_frequency_hz: Reciprocal of the update interval in Hz.
+ * @gain*:		 Response factor of the controller relative to each
+ *			 one of its linear input variables as fixed-point
+ *			 fraction.
  */
 struct vlp_data {
 	s64 sample_interval_ns;
 	int32_t sample_frequency_hz;
+	int32_t gain_aggr;
+	int32_t gain_rt;
+	int32_t gain;
 
 	struct vlp_input_stats stats;
 	struct vlp_status_sample status;
+	struct vlp_target_range target;
 };
 
 /**
@@ -323,12 +340,18 @@ static struct cpudata **all_cpu_data;
 /**
  * struct vlp_params - VLP controller static configuration
  * @sample_interval_ms:	     Update interval in ms.
+ * @setpoint_*_pml:	     Target CPU utilization at which the controller is
+ *			     expected to leave the current P-state untouched,
+ *			     as an integer per mille.
  * @avg*_hz:		     Exponential averaging frequencies of the various
  *			     low-pass filters as an integer in Hz.
  */
 struct vlp_params {
 	int sample_interval_ms;
+	int setpoint_0_pml;
+	int setpoint_aggr_pml;
 	int avg_hz;
+	int realtime_gain_pml;
 	int debug;
 };
 
@@ -362,7 +385,10 @@ struct pstate_funcs {
 static struct pstate_funcs pstate_funcs __read_mostly;
 static struct vlp_params vlp_params __read_mostly = {
 	.sample_interval_ms = 10,
+	.setpoint_0_pml = 900,
+	.setpoint_aggr_pml = 1500,
 	.avg_hz = 2,
+	.realtime_gain_pml = 12000,
 	.debug = 0,
 };
 
@@ -1873,6 +1899,11 @@ static void intel_pstate_reset_vlp(struct cpudata *cpu)
 	vlp->sample_interval_ns = vlp_params.sample_interval_ms * NSEC_PER_MSEC;
 	vlp->sample_frequency_hz = max(1u, (uint32_t)MSEC_PER_SEC /
 					   vlp_params.sample_interval_ms);
+	vlp->gain_rt = div_fp(cpu->pstate.max_pstate *
+			      vlp_params.realtime_gain_pml, 1000);
+	vlp->gain_aggr = max(1, div_fp(1000, vlp_params.setpoint_aggr_pml));
+	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
+	vlp->target.p_base = 0;
 	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
 }
 
@@ -1996,6 +2027,132 @@ static const struct vlp_status_sample *get_vlp_status_sample(
 	return last_status;
 }
 
+/**
+ * Calculate the target P-state range for the next update period.
+ * Uses a variably low-pass-filtering controller intended to improve
+ * energy efficiency when a CPU response frequency target is specified
+ * via PM QoS (e.g. under IO-bound conditions).
+ */
+static const struct vlp_target_range *get_vlp_target_range(struct cpudata *cpu)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+	struct vlp_target_range *last_target = &vlp->target;
+
+	/*
+	 * P-state limits in fixed-point as allowed by the policy.
+	 */
+	const int32_t p0 = int_tofp(max(cpu->pstate.min_pstate,
+					cpu->min_perf_ratio));
+	const int32_t p1 = int_tofp(cpu->max_perf_ratio);
+
+	/*
+	 * Observed average P-state during the sampling period.	 The
+	 * conservative path (po_cons) uses the TSC increment as
+	 * denominator which will give the minimum (arguably most
+	 * energy-efficient) P-state able to accomplish the observed
+	 * amount of work during the sampling period.
+	 *
+	 * The downside of that somewhat optimistic estimate is that
+	 * it can give a biased result for intermittent
+	 * latency-sensitive workloads, which may have to be completed
+	 * in a short window of time for the system to achieve maximum
+	 * performance, even if the average CPU utilization is low.
+	 * For that reason the aggressive path (po_aggr) uses the
+	 * MPERF increment as denominator, which is approximately
+	 * optimal under the pessimistic assumption that the CPU work
+	 * cannot be parallelized with any other dependent IO work
+	 * that subsequently keeps the CPU idle (partly in C1+
+	 * states).
+	 */
+	const int32_t po_cons =
+		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
+		       * cpu->pstate.max_pstate_physical,
+		       cpu->sample.tsc);
+	const int32_t po_aggr =
+		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
+		       * cpu->pstate.max_pstate_physical,
+		       (cpu->sample.mperf << cpu->aperf_mperf_shift));
+
+	const struct vlp_status_sample *status =
+		get_vlp_status_sample(cpu, po_cons);
+
+	/* Calculate the target P-state. */
+	const int32_t p_tgt_cons = mul_fp(vlp->gain, po_cons);
+	const int32_t p_tgt_aggr = mul_fp(vlp->gain_aggr, po_aggr);
+	const int32_t p_tgt = max(p0, min(p1, max(p_tgt_cons, p_tgt_aggr)));
+
+	/* Calculate the realtime P-state target lower bound. */
+	const int32_t pm = int_tofp(cpu->pstate.max_pstate);
+	const int32_t p_tgt_rt = min(pm, mul_fp(vlp->gain_rt,
+						status->realtime_avg));
+
+	/*
+	 * Low-pass filter the P-state estimate above by exponential
+	 * averaging.  For an oscillating workload (e.g. submitting
+	 * work repeatedly to a device like a soundcard or GPU) this
+	 * will approximate the minimum P-state that would be able to
+	 * accomplish the observed amount of work during the averaging
+	 * period, which is also the optimally energy-efficient one,
+	 * under the assumptions that:
+	 *
+	 *  - The power curve of the system is convex throughout the
+	 *    range of P-states allowed by the policy. I.e. energy
+	 *    efficiency is steadily decreasing with frequency past p0
+	 *    (which is typically close to the maximum-efficiency
+	 *    ratio).  In practice for the lower range of P-states
+	 *    this may only be approximately true due to the
+	 *    interaction between different components of the system.
+	 *
+	 *  - Parallelism constraints of the workload don't prevent it
+	 *    from achieving the same throughput at the lower P-state.
+	 *    This will happen in cases where the application is
+	 *    designed in a way that doesn't allow for dependent CPU
+	 *    and IO jobs to be pipelined, leading to alternating full
+	 *    and zero utilization of the CPU and IO device.  This
+	 *    will give an average IO device utilization lower than
+	 *    100% regardless of the CPU frequency, which should
+	 *    prevent the device driver from requesting a response
+	 *    frequency bound, so the filtered P-state calculated
+	 *    below won't have an influence on the controller
+	 *    response.
+	 *
+	 *  - The period of the oscillating workload is significantly
+	 *    shorter than the time constant of the exponential
+	 *    average (1s / last_response_frequency_hz).  Otherwise for
+	 *    more slowly oscillating workloads the controller
+	 *    response will roughly follow the oscillation, leading to
+	 *    decreased energy efficiency.
+	 *
+	 *  - The behavior of the workload doesn't change
+	 *    qualitatively during the next update interval.  This is
+	 *    only true in the steady state, and could possibly lead
+	 *    to a transitory period in which the controller response
+	 *    deviates from the most energy-efficient ratio until the
+	 *    workload reaches a steady state again.
+	 */
+	const int32_t alpha = get_last_sample_avg_weight(
+		cpu, vlp->stats.last_response_frequency_hz);
+
+	last_target->p_base = p_tgt + mul_fp(alpha,
+					     last_target->p_base - p_tgt);
+
+	/*
+	 * Use the low-pass-filtered controller response for better
+	 * energy efficiency unless we have reasons to believe that
+	 * some of the optimality assumptions discussed above may not
+	 * hold.
+	 */
+	if ((status->value & VLP_BOTTLENECK_IO)) {
+		last_target->value[0] = rnd_fp(p0);
+		last_target->value[1] = rnd_fp(last_target->p_base);
+	} else {
+		last_target->value[0] = rnd_fp(p_tgt_rt);
+		last_target->value[1] = rnd_fp(p1);
+	}
+
+	return last_target;
+}
+
 /**
  * Collect some scheduling and PM statistics in response to an
  * update_state() call.
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (5 preceding siblings ...)
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation Francisco Jerez
@ 2020-03-10 21:42 ` Francisco Jerez
  2020-03-17 23:59   ` Pandruvada, Srinivas
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID Francisco Jerez
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:42 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

This implements a simple variably low-pass-filtering governor in
control of the HWP MIN/MAX PERF range based on the previously
introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
Implement VLP controller target P-state range estimation." for the
rationale.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 79 +++++++++++++++++++++++++++++++++-
 1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index cecadfec8bc1..a01eed40d897 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1905,6 +1905,20 @@ static void intel_pstate_reset_vlp(struct cpudata *cpu)
 	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
 	vlp->target.p_base = 0;
 	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
+
+	if (hwp_active) {
+		const uint32_t p0 = max(cpu->pstate.min_pstate,
+					cpu->min_perf_ratio);
+		const uint32_t p1 = max_t(uint32_t, p0, cpu->max_perf_ratio);
+		const uint64_t hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
+					  ~(HWP_MAX_PERF(~0L) |
+					    HWP_MIN_PERF(~0L) |
+					    HWP_DESIRED_PERF(~0L))) |
+					 HWP_MIN_PERF(p0) | HWP_MAX_PERF(p1);
+
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
+		cpu->hwp_req_cached = hwp_req;
+	}
 }
 
 /**
@@ -2222,6 +2236,46 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
+					     const unsigned int range[])
+{
+	const int from = cpu->hwp_req_cached;
+	unsigned int p0, p1, p_min, p_max;
+	struct sample *sample;
+	uint64_t hwp_req;
+
+	update_turbo_state();
+
+	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
+	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
+	p_min = clamp_t(unsigned int, range[0], p0, p1);
+	p_max = clamp_t(unsigned int, range[1], p0, p1);
+
+	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
+
+	hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
+		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
+		     HWP_DESIRED_PERF(~0L))) |
+		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
+		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
+
+	if (hwp_req != cpu->hwp_req_cached) {
+		wrmsrl(MSR_HWP_REQUEST, hwp_req);
+		cpu->hwp_req_cached = hwp_req;
+	}
+
+	sample = &cpu->sample;
+	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
+			    fp_toint(sample->busy_scaled),
+			    from,
+			    hwp_req,
+			    sample->mperf,
+			    sample->aperf,
+			    sample->tsc,
+			    get_avg_frequency(cpu),
+			    fp_toint(cpu->iowait_boost * 100));
+}
+
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 				     unsigned int flags)
 {
@@ -2260,6 +2314,22 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 		intel_pstate_adjust_pstate(cpu);
 }
 
+/**
+ * Implementation of the cpufreq update_util hook based on the VLP
+ * controller (see get_vlp_target_range()).
+ */
+static void intel_pstate_update_util_hwp_vlp(struct update_util_data *data,
+					     u64 time, unsigned int flags)
+{
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+	if (update_vlp_sample(cpu, time, flags)) {
+		const struct vlp_target_range *target =
+			get_vlp_target_range(cpu);
+		intel_pstate_adjust_pstate_range(cpu, target->value);
+	}
+}
+
 static struct pstate_funcs core_funcs = {
 	.get_max = core_get_max_pstate,
 	.get_max_physical = core_get_max_pstate_physical,
@@ -2389,6 +2459,9 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
 
 	intel_pstate_get_cpu_pstates(cpu);
 
+	if (pstate_funcs.update_util == intel_pstate_update_util_hwp_vlp)
+		intel_pstate_reset_vlp(cpu);
+
 	pr_debug("controlling: cpu %d\n", cpunum);
 
 	return 0;
@@ -2398,7 +2471,8 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
 
-	if (hwp_active && !hwp_boost)
+	if (hwp_active && !hwp_boost &&
+	    pstate_funcs.update_util != intel_pstate_update_util_hwp_vlp)
 		return;
 
 	if (cpu->update_util_set)
@@ -2526,7 +2600,8 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 		 * was turned off, in that case we need to clear the
 		 * update util hook.
 		 */
-		if (!hwp_boost)
+		if (!hwp_boost && pstate_funcs.update_util !=
+				  intel_pstate_update_util_hwp_vlp)
 			intel_pstate_clear_update_util_hook(policy->cpu);
 		intel_pstate_hwp_set(policy->cpu);
 	}
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (6 preceding siblings ...)
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts Francisco Jerez
@ 2020-03-10 21:42 ` Francisco Jerez
  2020-03-19 11:20   ` Rafael J. Wysocki
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status Francisco Jerez
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:42 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

For the moment the VLP controller is only enabled on ICL platforms
other than server FADT profiles in order to reduce the validation
effort of the initial submission.  It should work on any other
processors that support HWP though (and soon enough on non-HWP too):
In order to override the default behavior (e.g. to test on other
platforms) the VLP controller can be forcefully enabled or disabled by
passing "intel_pstate=vlp" or "intel_pstate=no_vlp" respectively in
the kernel command line.

v2: Handle HWP VLP controller.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 .../admin-guide/kernel-parameters.txt         |  5 ++++
 Documentation/admin-guide/pm/intel_pstate.rst |  7 ++++++
 drivers/cpufreq/intel_pstate.c                | 25 +++++++++++++++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0c9894247015..9bc55fc2752e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1828,6 +1828,11 @@
 			per_cpu_perf_limits
 			  Allow per-logical-CPU P-State performance control limits using
 			  cpufreq sysfs interface
+			vlp
+			  Force use of VLP P-state controller.  Overrides selection
+			  derived from ACPI FADT profile.
+			no_vlp
+			  Prevent use of VLP P-state controller (see "vlp" parameter).
 
 	intremap=	[X86-64, Intel-IOMMU]
 			on	enable Interrupt Remapping (default)
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index 67e414e34f37..da6b64812848 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -669,6 +669,13 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
 	Use per-logical-CPU P-State limits (see `Coordination of P-state
 	Limits`_ for details).
 
+``vlp``
+	Force use of VLP P-state controller.  Overrides selection derived
+	from ACPI FADT profile.
+
+``no_vlp``
+	Prevent use of VLP P-state controller (see "vlp" parameter).
+
 
 Diagnostics and Tuning
 ======================
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index a01eed40d897..050cc8f03c26 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -3029,6 +3029,7 @@ static int intel_pstate_update_status(const char *buf, size_t size)
 
 static int no_load __initdata;
 static int no_hwp __initdata;
+static int vlp __initdata = -1;
 static int hwp_only __initdata;
 static unsigned int force_load __initdata;
 
@@ -3193,6 +3194,7 @@ static inline void intel_pstate_request_control_from_smm(void) {}
 #endif /* CONFIG_ACPI */
 
 #define INTEL_PSTATE_HWP_BROADWELL	0x01
+#define INTEL_PSTATE_HWP_VLP		0x02
 
 #define ICPU_HWP(model, hwp_mode) \
 	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_HWP, hwp_mode }
@@ -3200,12 +3202,15 @@ static inline void intel_pstate_request_control_from_smm(void) {}
 static const struct x86_cpu_id hwp_support_ids[] __initconst = {
 	ICPU_HWP(INTEL_FAM6_BROADWELL_X, INTEL_PSTATE_HWP_BROADWELL),
 	ICPU_HWP(INTEL_FAM6_BROADWELL_D, INTEL_PSTATE_HWP_BROADWELL),
+	ICPU_HWP(INTEL_FAM6_ICELAKE, INTEL_PSTATE_HWP_VLP),
+	ICPU_HWP(INTEL_FAM6_ICELAKE_L, INTEL_PSTATE_HWP_VLP),
 	ICPU_HWP(X86_MODEL_ANY, 0),
 	{}
 };
 
 static int __init intel_pstate_init(void)
 {
+	bool use_vlp = vlp == 1;
 	const struct x86_cpu_id *id;
 	int rc;
 
@@ -3222,8 +3227,19 @@ static int __init intel_pstate_init(void)
 			pstate_funcs.update_util = intel_pstate_update_util;
 		} else {
 			hwp_active++;
-			pstate_funcs.update_util = intel_pstate_update_util_hwp;
-			hwp_mode_bdw = id->driver_data;
+
+			if (vlp < 0 && !intel_pstate_acpi_pm_profile_server() &&
+			    (id->driver_data & INTEL_PSTATE_HWP_VLP)) {
+				/* Enable VLP controller by default. */
+				use_vlp = true;
+			}
+
+			pstate_funcs.update_util = use_vlp ?
+				intel_pstate_update_util_hwp_vlp :
+				intel_pstate_update_util_hwp;
+
+			hwp_mode_bdw = (id->driver_data &
+					INTEL_PSTATE_HWP_BROADWELL);
 			intel_pstate.attr = hwp_cpufreq_attrs;
 			goto hwp_cpu_matched;
 		}
@@ -3301,6 +3317,11 @@ static int __init intel_pstate_setup(char *str)
 	if (!strcmp(str, "per_cpu_perf_limits"))
 		per_cpu_limits = true;
 
+	if (!strcmp(str, "vlp"))
+		vlp = 1;
+	if (!strcmp(str, "no_vlp"))
+		vlp = 0;
+
 #ifdef CONFIG_ACPI
 	if (!strcmp(str, "support_acpi_ppc"))
 		acpi_ppc = true;
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (7 preceding siblings ...)
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID Francisco Jerez
@ 2020-03-10 21:42 ` Francisco Jerez
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs Francisco Jerez
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:42 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c |  9 ++++++---
 include/trace/events/power.h   | 13 +++++++++----
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 050cc8f03c26..c4558a131660 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2233,7 +2233,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		fp_toint(cpu->iowait_boost * 100),
+		cpu->vlp.status.value);
 }
 
 static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
@@ -2273,7 +2274,8 @@ static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
 			    sample->aperf,
 			    sample->tsc,
 			    get_avg_frequency(cpu),
-			    fp_toint(cpu->iowait_boost * 100));
+			    fp_toint(cpu->iowait_boost * 100),
+			    cpu->vlp.status.value);
 }
 
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
@@ -2782,7 +2784,8 @@ static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, in
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		fp_toint(cpu->iowait_boost * 100),
+		0);
 }
 
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 7e4b52e8ca3a..e94d5e618175 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -72,7 +72,8 @@ TRACE_EVENT(pstate_sample,
 		u64 aperf,
 		u64 tsc,
 		u32 freq,
-		u32 io_boost
+		u32 io_boost,
+		u32 vlp_status
 		),
 
 	TP_ARGS(core_busy,
@@ -83,7 +84,8 @@ TRACE_EVENT(pstate_sample,
 		aperf,
 		tsc,
 		freq,
-		io_boost
+		io_boost,
+		vlp_status
 		),
 
 	TP_STRUCT__entry(
@@ -96,6 +98,7 @@ TRACE_EVENT(pstate_sample,
 		__field(u64, tsc)
 		__field(u32, freq)
 		__field(u32, io_boost)
+		__field(u32, vlp_status)
 		),
 
 	TP_fast_assign(
@@ -108,9 +111,10 @@ TRACE_EVENT(pstate_sample,
 		__entry->tsc = tsc;
 		__entry->freq = freq;
 		__entry->io_boost = io_boost;
+		__entry->vlp_status = vlp_status;
 		),
 
-	TP_printk("core_busy=%lu scaled=%lu from=%lu to=%lu mperf=%llu aperf=%llu tsc=%llu freq=%lu io_boost=%lu",
+	TP_printk("core_busy=%lu scaled=%lu from=%lu to=%lu mperf=%llu aperf=%llu tsc=%llu freq=%lu io_boost=%lu vlp=%lu",
 		(unsigned long)__entry->core_busy,
 		(unsigned long)__entry->scaled_busy,
 		(unsigned long)__entry->from,
@@ -119,7 +123,8 @@ TRACE_EVENT(pstate_sample,
 		(unsigned long long)__entry->aperf,
 		(unsigned long long)__entry->tsc,
 		(unsigned long)__entry->freq,
-		(unsigned long)__entry->io_boost
+		(unsigned long)__entry->io_boost,
+		(unsigned long)__entry->vlp_status
 		)
 
 );
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (8 preceding siblings ...)
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status Francisco Jerez
@ 2020-03-10 21:42 ` Francisco Jerez
  2020-03-11  2:35 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Francisco Jerez @ 2020-03-10 21:42 UTC (permalink / raw)
  To: linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki, Julia Lawall, Pandruvada,
	Srinivas, Fengguang Wu

This is not required for the controller to work but has proven very
useful for debugging and testing of alternative heuristic parameters,
which may offer a better trade-off between energy efficiency and
latency.  A warning is printed out which should taint the kernel for
the non-standard calibration of the heuristic to be obvious in bug
reports.

v2: Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
    for debugfs files (Julia).  Add realtime statistic threshold and
    averaging frequency parameters.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
---
 drivers/cpufreq/intel_pstate.c | 92 ++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index c4558a131660..ab893a211746 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1030,6 +1030,94 @@ static void intel_pstate_update_limits(unsigned int cpu)
 	mutex_unlock(&intel_pstate_driver_lock);
 }
 
+/************************** debugfs begin ************************/
+static void intel_pstate_reset_vlp(struct cpudata *cpu);
+
+static int vlp_param_set(void *data, u64 val)
+{
+	unsigned int cpu;
+
+	*(u32 *)data = val;
+	for_each_possible_cpu(cpu) {
+		if (all_cpu_data[cpu])
+			intel_pstate_reset_vlp(all_cpu_data[cpu]);
+	}
+
+	WARN_ONCE(1, "Unsupported P-state VLP parameter update via debugging interface");
+
+	return 0;
+}
+
+static int vlp_param_get(void *data, u64 *val)
+{
+	*val = *(u32 *)data;
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(fops_vlp_param, vlp_param_get, vlp_param_set,
+			 "%llu\n");
+
+static struct dentry *debugfs_parent;
+
+struct vlp_param {
+	char *name;
+	void *value;
+	struct dentry *dentry;
+};
+
+static struct vlp_param vlp_files[] = {
+	{"vlp_sample_interval_ms", &vlp_params.sample_interval_ms, },
+	{"vlp_setpoint_0_pml", &vlp_params.setpoint_0_pml, },
+	{"vlp_setpoint_aggr_pml", &vlp_params.setpoint_aggr_pml, },
+	{"vlp_avg_hz", &vlp_params.avg_hz, },
+	{"vlp_realtime_gain_pml", &vlp_params.realtime_gain_pml, },
+	{"vlp_debug", &vlp_params.debug, },
+	{NULL, NULL, }
+};
+
+static void intel_pstate_update_util_hwp_vlp(struct update_util_data *data,
+					     u64 time, unsigned int flags);
+
+static void intel_pstate_debug_expose_params(void)
+{
+	int i;
+
+	if (pstate_funcs.update_util != intel_pstate_update_util_hwp_vlp)
+		return;
+
+	debugfs_parent = debugfs_create_dir("pstate_snb", NULL);
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; vlp_files[i].name; i++) {
+		struct dentry *dentry;
+
+		dentry = debugfs_create_file_unsafe(vlp_files[i].name, 0660,
+						    debugfs_parent,
+						    vlp_files[i].value,
+						    &fops_vlp_param);
+		if (!IS_ERR(dentry))
+			vlp_files[i].dentry = dentry;
+	}
+}
+
+static void intel_pstate_debug_hide_params(void)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; vlp_files[i].name; i++) {
+		debugfs_remove(vlp_files[i].dentry);
+		vlp_files[i].dentry = NULL;
+	}
+
+	debugfs_remove(debugfs_parent);
+	debugfs_parent = NULL;
+}
+
+/************************** debugfs end ************************/
+
 /************************** sysfs begin ************************/
 #define show_one(file_name, object)					\
 	static ssize_t show_##file_name					\
@@ -2970,6 +3058,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	intel_pstate_debug_expose_params();
+
 	return 0;
 }
 
@@ -2978,6 +3068,8 @@ static int intel_pstate_unregister_driver(void)
 	if (hwp_active)
 		return -EBUSY;
 
+	intel_pstate_debug_hide_params();
+
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load Francisco Jerez
@ 2020-03-10 22:26   ` Chris Wilson
  2020-03-11  0:34     ` Francisco Jerez
  2020-03-11 10:00     ` Tvrtko Ursulin
  0 siblings, 2 replies; 44+ messages in thread
From: Chris Wilson @ 2020-03-10 22:26 UTC (permalink / raw)
  To: Francisco Jerez, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

Quoting Francisco Jerez (2020-03-10 21:41:55)
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index b9b3f78f1324..a5d7a80b826d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>         /* we need to manually load the submit queue */
>         if (execlists->ctrl_reg)
>                 writel(EL_CTRL_LOAD, execlists->ctrl_reg);
> +
> +       if (execlists_num_ports(execlists) > 1 &&
pending[1] is always defined, the minimum submission is one slot, with
pending[1] as the sentinel NULL.

> +           execlists->pending[1] &&
> +           !atomic_xchg(&execlists->overload, 1))
> +               intel_gt_pm_active_begin(&engine->i915->gt);

engine->gt

>  }
>  
>  static bool ctx_single_port_submission(const struct intel_context *ce)
> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>         clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>  
>         WRITE_ONCE(execlists->active, execlists->inflight);
> +
> +       if (atomic_xchg(&execlists->overload, 0)) {
> +               struct intel_engine_cs *engine =
> +                       container_of(execlists, typeof(*engine), execlists);
> +               intel_gt_pm_active_end(&engine->i915->gt);
> +       }
>  }
>  
>  static inline void
> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>                         /* port0 completed, advanced to port1 */
>                         trace_ports(execlists, "completed", execlists->active);
>  
> +                       if (atomic_xchg(&execlists->overload, 0))
> +                               intel_gt_pm_active_end(&engine->i915->gt);

So this looses track if we preempt a dual-ELSP submission with a
single-ELSP submission (and never go back to dual).

If you move this to the end of the loop and check

if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
	intel_gt_pm_active_end(engine->gt);

so that it covers both preemption/promotion and completion.

However, that will fluctuate quite rapidly. (And runs the risk of
exceeding the sentinel.)

An alternative approach would be to couple along
schedule_in/schedule_out

atomic_set(overload, -1);

__execlists_schedule_in:
	if (!atomic_fetch_inc(overload)
		intel_gt_pm_active_begin(engine->gt);
__execlists_schedule_out:
	if (!atomic_dec_return(overload)
		intel_gt_pm_active_end(engine->gt);

which would mean we are overloaded as soon as we try to submit an
overlapping ELSP.


The metric feels very multiple client (game + display server, or
saturated transcode) centric. In the endless kernel world, we expect
100% engine utilisation from a single context, and never a dual-ELSP
submission. They are also likely to want to avoid being throttled to
converse TDP for the CPU.

Should we also reduce the overload for the number of clients who are
waiting for interrupts from the GPU, so that their wakeup latency is not
impacted?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-10 22:26   ` Chris Wilson
@ 2020-03-11  0:34     ` Francisco Jerez
  2020-03-18 19:42       ` Francisco Jerez
  2020-03-11 10:00     ` Tvrtko Ursulin
  1 sibling, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-11  0:34 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


[-- Attachment #1.1.1: Type: text/plain, Size: 5531 bytes --]

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2020-03-10 21:41:55)
>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> index b9b3f78f1324..a5d7a80b826d 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>         /* we need to manually load the submit queue */
>>         if (execlists->ctrl_reg)
>>                 writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>> +
>> +       if (execlists_num_ports(execlists) > 1 &&
> pending[1] is always defined, the minimum submission is one slot, with
> pending[1] as the sentinel NULL.
>
>> +           execlists->pending[1] &&
>> +           !atomic_xchg(&execlists->overload, 1))
>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>
> engine->gt
>

Applied your suggestions above locally, will probably wait to have a few
more changes batched up before sending a v2.

>>  }
>>  
>>  static bool ctx_single_port_submission(const struct intel_context *ce)
>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>         clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>  
>>         WRITE_ONCE(execlists->active, execlists->inflight);
>> +
>> +       if (atomic_xchg(&execlists->overload, 0)) {
>> +               struct intel_engine_cs *engine =
>> +                       container_of(execlists, typeof(*engine), execlists);
>> +               intel_gt_pm_active_end(&engine->i915->gt);
>> +       }
>>  }
>>  
>>  static inline void
>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>                         /* port0 completed, advanced to port1 */
>>                         trace_ports(execlists, "completed", execlists->active);
>>  
>> +                       if (atomic_xchg(&execlists->overload, 0))
>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>
> So this looses track if we preempt a dual-ELSP submission with a
> single-ELSP submission (and never go back to dual).
>

Yes, good point.  You're right that if a dual-ELSP submission gets
preempted by a single-ELSP submission "overload" will remain signaled
until the first completion interrupt arrives (e.g. from the preempting
submission).

> If you move this to the end of the loop and check
>
> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
> 	intel_gt_pm_active_end(engine->gt);
>
> so that it covers both preemption/promotion and completion.
>

That sounds reasonable.

> However, that will fluctuate quite rapidly. (And runs the risk of
> exceeding the sentinel.)
>
> An alternative approach would be to couple along
> schedule_in/schedule_out
>
> atomic_set(overload, -1);
>
> __execlists_schedule_in:
> 	if (!atomic_fetch_inc(overload)
> 		intel_gt_pm_active_begin(engine->gt);
> __execlists_schedule_out:
> 	if (!atomic_dec_return(overload)
> 		intel_gt_pm_active_end(engine->gt);
>
> which would mean we are overloaded as soon as we try to submit an
> overlapping ELSP.
>

That sounds good to me too, and AFAICT would have roughly the same
behavior as this metric except for the preemption corner case you
mention above.  I'll try this and verify that I get approximately the
same performance numbers.

>
> The metric feels very multiple client (game + display server, or
> saturated transcode) centric. In the endless kernel world, we expect
> 100% engine utilisation from a single context, and never a dual-ELSP
> submission. They are also likely to want to avoid being throttled to
> converse TDP for the CPU.
>
Yes, this metric is fairly conservative, it won't trigger in all cases
which would potentially benefit from the energy efficiency optimization,
only where we can be reasonably certain that CPU latency is not critical
in order to keep the GPU busy (e.g. because the CS has an additional
ELSP port pending execution that will immediately kick in as soon as the
current one completes).

My original approach was to call intel_gt_pm_active_begin() directly as
soon as the first ELSP is submitted to the GPU, which was somewhat more
effective at improving the energy efficiency of the system than waiting
for the second port to be in use, but it involved a slight execlists
submission latency cost that led to some regressions.  It would
certainly cover the single-context case you have in mind though.  I'll
get some updated numbers with my previous approach so we can decide
which one provides a better trade-off.

> Should we also reduce the overload for the number of clients who are
> waiting for interrupts from the GPU, so that their wakeup latency is not
> impacted?

A number of clients waiting doesn't necessarily indicate that wake-up
latency is a concern.  It frequently indicates the opposite: That the
GPU has a bottleneck which will only be exacerbated by attempting to
reduce the ramp-up latency of the CPU.  IOW, I think we should only care
about reducing the CPU wake-up latency in cases where the client is
unable to keep the GPU fully utilized with the latency target which
allows the GPU to run at maximum throughput -- If the client is unable
to it will already cause the GPU utilization to drop, so the PM QoS
request will be removed whether it is waiting or not.

> -Chris

Thanks!

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (9 preceding siblings ...)
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs Francisco Jerez
@ 2020-03-11  2:35 ` Pandruvada, Srinivas
  2020-03-11  3:55   ` Francisco Jerez
  2020-03-11  4:25 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Pandruvada, Srinivas @ 2020-03-11  2:35 UTC (permalink / raw)
  To: linux-pm, currojerez, intel-gfx; +Cc: peterz, rjw

On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
> 

[...]

> Thanks in advance for any review feed-back and test reports.
> 
> [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
> limit.
> [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU
> load.
> [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters
> via debugfs.
> [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from
> pstate_funcs"
> [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
> statistics and status calculation.
> [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target
> P-state range estimation.
> [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP
> parts.
> [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on
> ACPI FADT profile and CPUID.
> [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP
> controller status.
> [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller
> parameters via debugfs.
> 
Do you have debug patch (You don't to submit as a patch), which will
allow me to dynamically disable/enable all these changes? I want to
compare and do some measurements.

Thanks,
Srinivas 

> [1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
> [2] 
> https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
> [3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
> [4] 
> http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-11  2:35 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
@ 2020-03-11  3:55   ` Francisco Jerez
  0 siblings, 0 replies; 44+ messages in thread
From: Francisco Jerez @ 2020-03-11  3:55 UTC (permalink / raw)
  To: Pandruvada, Srinivas, linux-pm, intel-gfx; +Cc: peterz, rjw


[-- Attachment #1.1.1: Type: text/plain, Size: 1540 bytes --]

"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:

> On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
>> 
>
> [...]
>
>> Thanks in advance for any review feed-back and test reports.
>> 
>> [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
>> limit.
>> [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU
>> load.
>> [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters
>> via debugfs.
>> [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from
>> pstate_funcs"
>> [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
>> statistics and status calculation.
>> [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target
>> P-state range estimation.
>> [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP
>> parts.
>> [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on
>> ACPI FADT profile and CPUID.
>> [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP
>> controller status.
>> [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller
>> parameters via debugfs.
>> 
> Do you have debug patch (You don't to submit as a patch), which will
> allow me to dynamically disable/enable all these changes? I want to
> compare and do some measurements.
>

Something like this (fully untested) patch?  It should prevent the VLP
controller from running if you do:

echo 16 > /sys/kernel/debug/pstate_snb/lp_debug

> Thanks,
> Srinivas 
>
>>[...]


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: 0001-DEBUG.patch --]
[-- Type: text/x-patch, Size: 590 bytes --]

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index ab893a211746..8749b4a14447 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2411,6 +2411,9 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 static void intel_pstate_update_util_hwp_vlp(struct update_util_data *data,
 					     u64 time, unsigned int flags)
 {
+	if ((vlp_params.debug & 16))
+		return;
+
 	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
 
 	if (update_vlp_sample(cpu, time, flags)) {

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (10 preceding siblings ...)
  2020-03-11  2:35 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
@ 2020-03-11  4:25 ` Patchwork
  2020-03-12  2:31 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2) Patchwork
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2020-03-11  4:25 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: intel-gfx

== Series Details ==

Series: GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
URL   : https://patchwork.freedesktop.org/series/74540/
State : failure

== Summary ==

Applying: PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
error: sha1 information is lacking or useless (include/linux/pm_qos.h).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-10 22:26   ` Chris Wilson
  2020-03-11  0:34     ` Francisco Jerez
@ 2020-03-11 10:00     ` Tvrtko Ursulin
  2020-03-11 10:21       ` Chris Wilson
  2020-03-11 19:54       ` Francisco Jerez
  1 sibling, 2 replies; 44+ messages in thread
From: Tvrtko Ursulin @ 2020-03-11 10:00 UTC (permalink / raw)
  To: Chris Wilson, Francisco Jerez, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


On 10/03/2020 22:26, Chris Wilson wrote:
> Quoting Francisco Jerez (2020-03-10 21:41:55)
>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> index b9b3f78f1324..a5d7a80b826d 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>          /* we need to manually load the submit queue */
>>          if (execlists->ctrl_reg)
>>                  writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>> +
>> +       if (execlists_num_ports(execlists) > 1 &&
> pending[1] is always defined, the minimum submission is one slot, with
> pending[1] as the sentinel NULL.
> 
>> +           execlists->pending[1] &&
>> +           !atomic_xchg(&execlists->overload, 1))
>> +               intel_gt_pm_active_begin(&engine->i915->gt);
> 
> engine->gt
> 
>>   }
>>   
>>   static bool ctx_single_port_submission(const struct intel_context *ce)
>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>          clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>   
>>          WRITE_ONCE(execlists->active, execlists->inflight);
>> +
>> +       if (atomic_xchg(&execlists->overload, 0)) {
>> +               struct intel_engine_cs *engine =
>> +                       container_of(execlists, typeof(*engine), execlists);
>> +               intel_gt_pm_active_end(&engine->i915->gt);
>> +       }
>>   }
>>   
>>   static inline void
>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>                          /* port0 completed, advanced to port1 */
>>                          trace_ports(execlists, "completed", execlists->active);
>>   
>> +                       if (atomic_xchg(&execlists->overload, 0))
>> +                               intel_gt_pm_active_end(&engine->i915->gt);
> 
> So this looses track if we preempt a dual-ELSP submission with a
> single-ELSP submission (and never go back to dual).
> 
> If you move this to the end of the loop and check
> 
> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
> 	intel_gt_pm_active_end(engine->gt);
> 
> so that it covers both preemption/promotion and completion.
> 
> However, that will fluctuate quite rapidly. (And runs the risk of
> exceeding the sentinel.)
> 
> An alternative approach would be to couple along
> schedule_in/schedule_out
> 
> atomic_set(overload, -1);
> 
> __execlists_schedule_in:
> 	if (!atomic_fetch_inc(overload)
> 		intel_gt_pm_active_begin(engine->gt);
> __execlists_schedule_out:
> 	if (!atomic_dec_return(overload)
> 		intel_gt_pm_active_end(engine->gt);
> 
> which would mean we are overloaded as soon as we try to submit an
> overlapping ELSP.

Putting it this low-level into submission code also would not work well 
with GuC.

How about we try to keep some accounting one level higher, as the i915 
scheduler is passing requests on to the backend for execution?

Or number of runnable contexts, if the distinction between contexts and 
requests is better for this purpose.

Problematic bit in going one level higher though is that the exit point 
is less precisely coupled to the actual state. Or maybe with aggressive 
engine retire we have nowadays it wouldn't be a problem.

Regards,

Tvrtko

> 
> 
> The metric feels very multiple client (game + display server, or
> saturated transcode) centric. In the endless kernel world, we expect
> 100% engine utilisation from a single context, and never a dual-ELSP
> submission. They are also likely to want to avoid being throttled to
> converse TDP for the CPU.
> 
> Should we also reduce the overload for the number of clients who are
> waiting for interrupts from the GPU, so that their wakeup latency is not
> impacted?
> -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-11 10:00     ` Tvrtko Ursulin
@ 2020-03-11 10:21       ` Chris Wilson
  2020-03-11 19:54       ` Francisco Jerez
  1 sibling, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2020-03-11 10:21 UTC (permalink / raw)
  To: Francisco Jerez, Tvrtko Ursulin, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

Quoting Tvrtko Ursulin (2020-03-11 10:00:41)
> 
> On 10/03/2020 22:26, Chris Wilson wrote:
> > Quoting Francisco Jerez (2020-03-10 21:41:55)
> >>   static inline void
> >> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
> >>                          /* port0 completed, advanced to port1 */
> >>                          trace_ports(execlists, "completed", execlists->active);
> >>   
> >> +                       if (atomic_xchg(&execlists->overload, 0))
> >> +                               intel_gt_pm_active_end(&engine->i915->gt);
> > 
> > So this looses track if we preempt a dual-ELSP submission with a
> > single-ELSP submission (and never go back to dual).
> > 
> > If you move this to the end of the loop and check
> > 
> > if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
> >       intel_gt_pm_active_end(engine->gt);
> > 
> > so that it covers both preemption/promotion and completion.
> > 
> > However, that will fluctuate quite rapidly. (And runs the risk of
> > exceeding the sentinel.)
> > 
> > An alternative approach would be to couple along
> > schedule_in/schedule_out
> > 
> > atomic_set(overload, -1);
> > 
> > __execlists_schedule_in:
> >       if (!atomic_fetch_inc(overload)
> >               intel_gt_pm_active_begin(engine->gt);
> > __execlists_schedule_out:
> >       if (!atomic_dec_return(overload)
> >               intel_gt_pm_active_end(engine->gt);
> > 
> > which would mean we are overloaded as soon as we try to submit an
> > overlapping ELSP.
> 
> Putting it this low-level into submission code also would not work well 
> with GuC.

We can cross that bridge when it is built. [The GuC is also likely to
not want to play with us anyway, and just use SLPC.]

Now, I suspect we may want to use an engine utilisation (busy-stats or
equivalent) metric, but honestly if we can finally land this work it
brings huge benefit for GPU bound TDP constrained workloads. (p-state
loves to starve the GPU even when it provides no extra benefit for the
CPU.) We can raise the bar, establish expected behaviour and then work
to maintain and keep on improving.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit Francisco Jerez
@ 2020-03-11 12:42   ` Peter Zijlstra
  2020-03-11 19:23     ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2020-03-11 12:42 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: intel-gfx, Rafael J. Wysocki, Pandruvada, Srinivas, linux-pm

On Tue, Mar 10, 2020 at 02:41:54PM -0700, Francisco Jerez wrote:
> +static void cpu_response_frequency_qos_apply(struct pm_qos_request *req,
> +					     enum pm_qos_req_action action,
> +					     s32 value)
> +{
> +	int ret = pm_qos_update_target(req->qos, &req->node, action, value);
> +
> +	if (ret > 0)
> +		wake_up_all_idle_cpus();
> +}

That's a pretty horrific thing to do; how often do we expect to call
this?
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Intel-gfx] [PATCHv2 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
  2020-03-11 19:23     ` Francisco Jerez
@ 2020-03-11 19:23       ` Francisco Jerez
  2020-03-19 10:25         ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-11 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, Rafael J. Wysocki, Pandruvada, Srinivas, linux-pm

The purpose of this PM QoS limit is to give device drivers additional
control over the latency/energy efficiency trade-off made by the PM
subsystem (particularly the CPUFREQ governor).  It allows device
drivers to set a lower bound on the response latency of PM (defined as
the time it takes from wake-up to the CPU reaching a certain
steady-state level of performance [e.g. the nominal frequency] in
response to a step-function load).  It reports to PM the minimum
ramp-up latency considered of use to the application, and explicitly
requests PM to filter out oscillations faster than the specified
frequency.  It is somewhat complementary to the current
CPU_DMA_LATENCY PM QoS class which can be understood as specifying an
upper latency bound on the CPU wake-up time, instead of a lower bound
on the CPU frequency ramp-up time.

Note that even though this provides a latency constraint it's
represented as its reciprocal in Hz units for computational efficiency
(since it would take a 64-bit division to compute the number of cycles
elapsed from a time increment in nanoseconds and a time bound, while a
frequency can simply be multiplied with the time increment).

This implements a MAX constraint so that the strictest (highest
response frequency) request is honored.  This means that PM won't
provide any guarantee that frequencies greater than the specified
bound will be filtered, since that might be incompatible with the
constraints specified by another more latency-sensitive application (A
more fine-grained result could be achieved with a scheduling-based
interface).  The default value needs to be equal to zero (best effort)
for it to behave as identity of the MAX operation.

v2: Drop wake_up_all_idle_cpus() call from
    cpu_response_frequency_qos_apply() (Peter).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 include/linux/pm_qos.h       |   9 +++
 include/trace/events/power.h |  33 +++++----
 kernel/power/qos.c           | 138 ++++++++++++++++++++++++++++++++++-
 3 files changed, 162 insertions(+), 18 deletions(-)

diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 4a69d4af3ff8..b522e2194c05 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -28,6 +28,7 @@ enum pm_qos_flags_status {
 #define PM_QOS_LATENCY_ANY_NS	((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC)
 
 #define PM_QOS_CPU_LATENCY_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
+#define PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE 0
 #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS	PM_QOS_LATENCY_ANY_NS
@@ -162,6 +163,14 @@ static inline void cpu_latency_qos_update_request(struct pm_qos_request *req,
 static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) {}
 #endif
 
+s32 cpu_response_frequency_qos_limit(void);
+bool cpu_response_frequency_qos_request_active(struct pm_qos_request *req);
+void cpu_response_frequency_qos_add_request(struct pm_qos_request *req,
+					    s32 value);
+void cpu_response_frequency_qos_update_request(struct pm_qos_request *req,
+					       s32 new_value);
+void cpu_response_frequency_qos_remove_request(struct pm_qos_request *req);
+
 #ifdef CONFIG_PM
 enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
 enum pm_qos_flags_status dev_pm_qos_flags(struct device *dev, s32 mask);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index af5018aa9517..7e4b52e8ca3a 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -359,45 +359,48 @@ DEFINE_EVENT(power_domain, power_domain_target,
 );
 
 /*
- * CPU latency QoS events used for global CPU latency QoS list updates
+ * CPU latency/response frequency QoS events used for global CPU PM
+ * QoS list updates.
  */
-DECLARE_EVENT_CLASS(cpu_latency_qos_request,
+DECLARE_EVENT_CLASS(pm_qos_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value),
+	TP_ARGS(name, value),
 
 	TP_STRUCT__entry(
+		__string(name,			 name		)
 		__field( s32,                    value          )
 	),
 
 	TP_fast_assign(
+		__assign_str(name, name);
 		__entry->value = value;
 	),
 
-	TP_printk("CPU_DMA_LATENCY value=%d",
-		  __entry->value)
+	TP_printk("pm_qos_class=%s value=%d",
+		  __get_str(name), __entry->value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_add_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_add_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_update_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_update_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_remove_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_remove_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
 /*
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index 32927682bcc4..49f140aa5aa1 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -271,7 +271,7 @@ void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value)
 		return;
 	}
 
-	trace_pm_qos_add_request(value);
+	trace_pm_qos_add_request("CPU_DMA_LATENCY", value);
 
 	req->qos = &cpu_latency_constraints;
 	cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value);
@@ -297,7 +297,7 @@ void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value)
 		return;
 	}
 
-	trace_pm_qos_update_request(new_value);
+	trace_pm_qos_update_request("CPU_DMA_LATENCY", new_value);
 
 	if (new_value == req->node.prio)
 		return;
@@ -323,7 +323,7 @@ void cpu_latency_qos_remove_request(struct pm_qos_request *req)
 		return;
 	}
 
-	trace_pm_qos_remove_request(PM_QOS_DEFAULT_VALUE);
+	trace_pm_qos_remove_request("CPU_DMA_LATENCY", PM_QOS_DEFAULT_VALUE);
 
 	cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
 	memset(req, 0, sizeof(*req));
@@ -424,6 +424,138 @@ static int __init cpu_latency_qos_init(void)
 late_initcall(cpu_latency_qos_init);
 #endif /* CONFIG_CPU_IDLE */
 
+/* Definitions related to the CPU response frequency QoS. */
+
+static struct pm_qos_constraints cpu_response_frequency_constraints = {
+	.list = PLIST_HEAD_INIT(cpu_response_frequency_constraints.list),
+	.target_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
+	.default_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
+	.no_constraint_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
+	.type = PM_QOS_MAX,
+};
+
+/**
+ * cpu_response_frequency_qos_limit - Return current system-wide CPU
+ *				      response frequency QoS limit.
+ */
+s32 cpu_response_frequency_qos_limit(void)
+{
+	return pm_qos_read_value(&cpu_response_frequency_constraints);
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_limit);
+
+/**
+ * cpu_response_frequency_qos_request_active - Check the given PM QoS request.
+ * @req: PM QoS request to check.
+ *
+ * Return: 'true' if @req has been added to the CPU response frequency
+ * QoS list, 'false' otherwise.
+ */
+bool cpu_response_frequency_qos_request_active(struct pm_qos_request *req)
+{
+	return req->qos == &cpu_response_frequency_constraints;
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_request_active);
+
+static void cpu_response_frequency_qos_apply(struct pm_qos_request *req,
+					     enum pm_qos_req_action action,
+					     s32 value)
+{
+	pm_qos_update_target(req->qos, &req->node, action, value);
+}
+
+/**
+ * cpu_response_frequency_qos_add_request - Add new CPU response
+ *					    frequency QoS request.
+ * @req: Pointer to a preallocated handle.
+ * @value: Requested constraint value.
+ *
+ * Use @value to initialize the request handle pointed to by @req,
+ * insert it as a new entry to the CPU response frequency QoS list and
+ * recompute the effective QoS constraint for that list.
+ *
+ * Callers need to save the handle for later use in updates and removal of the
+ * QoS request represented by it.
+ */
+void cpu_response_frequency_qos_add_request(struct pm_qos_request *req,
+					    s32 value)
+{
+	if (!req)
+		return;
+
+	if (cpu_response_frequency_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for already added request\n",
+		     __func__);
+		return;
+	}
+
+	trace_pm_qos_add_request("CPU_RESPONSE_FREQUENCY", value);
+
+	req->qos = &cpu_response_frequency_constraints;
+	cpu_response_frequency_qos_apply(req, PM_QOS_ADD_REQ, value);
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_add_request);
+
+/**
+ * cpu_response_frequency_qos_update_request - Modify existing CPU
+ *					       response frequency QoS
+ *					       request.
+ * @req : QoS request to update.
+ * @new_value: New requested constraint value.
+ *
+ * Use @new_value to update the QoS request represented by @req in the
+ * CPU response frequency QoS list along with updating the effective
+ * constraint value for that list.
+ */
+void cpu_response_frequency_qos_update_request(struct pm_qos_request *req,
+					       s32 new_value)
+{
+	if (!req)
+		return;
+
+	if (!cpu_response_frequency_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_update_request("CPU_RESPONSE_FREQUENCY", new_value);
+
+	if (new_value == req->node.prio)
+		return;
+
+	cpu_response_frequency_qos_apply(req, PM_QOS_UPDATE_REQ, new_value);
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_update_request);
+
+/**
+ * cpu_response_frequency_qos_remove_request - Remove existing CPU
+ *					       response frequency QoS
+ *					       request.
+ * @req: QoS request to remove.
+ *
+ * Remove the CPU response frequency QoS request represented by @req
+ * from the CPU response frequency QoS list along with updating the
+ * effective constraint value for that list.
+ */
+void cpu_response_frequency_qos_remove_request(struct pm_qos_request *req)
+{
+	if (!req)
+		return;
+
+	if (!cpu_response_frequency_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_remove_request("CPU_RESPONSE_FREQUENCY",
+				    PM_QOS_DEFAULT_VALUE);
+
+	cpu_response_frequency_qos_apply(req, PM_QOS_REMOVE_REQ,
+					 PM_QOS_DEFAULT_VALUE);
+	memset(req, 0, sizeof(*req));
+}
+EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_remove_request);
+
 /* Definitions related to the frequency QoS below. */
 
 /**
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
  2020-03-11 12:42   ` Peter Zijlstra
@ 2020-03-11 19:23     ` Francisco Jerez
  2020-03-11 19:23       ` [Intel-gfx] [PATCHv2 " Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-11 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, Rafael J. Wysocki, Pandruvada, Srinivas, linux-pm


[-- Attachment #1.1.1: Type: text/plain, Size: 592 bytes --]

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Mar 10, 2020 at 02:41:54PM -0700, Francisco Jerez wrote:
>> +static void cpu_response_frequency_qos_apply(struct pm_qos_request *req,
>> +					     enum pm_qos_req_action action,
>> +					     s32 value)
>> +{
>> +	int ret = pm_qos_update_target(req->qos, &req->node, action, value);
>> +
>> +	if (ret > 0)
>> +		wake_up_all_idle_cpus();
>> +}
>
> That's a pretty horrific thing to do; how often do we expect to call
> this?

Dropped.  It sneaked in while copy-pasting cpu_latency_qos_apply(), but
it's not necessary for our use-case.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-11 10:00     ` Tvrtko Ursulin
  2020-03-11 10:21       ` Chris Wilson
@ 2020-03-11 19:54       ` Francisco Jerez
  2020-03-12 11:52         ` Tvrtko Ursulin
  1 sibling, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-11 19:54 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


[-- Attachment #1.1.1: Type: text/plain, Size: 4690 bytes --]

Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:

> On 10/03/2020 22:26, Chris Wilson wrote:
>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> index b9b3f78f1324..a5d7a80b826d 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>          /* we need to manually load the submit queue */
>>>          if (execlists->ctrl_reg)
>>>                  writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>> +
>>> +       if (execlists_num_ports(execlists) > 1 &&
>> pending[1] is always defined, the minimum submission is one slot, with
>> pending[1] as the sentinel NULL.
>> 
>>> +           execlists->pending[1] &&
>>> +           !atomic_xchg(&execlists->overload, 1))
>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>> 
>> engine->gt
>> 
>>>   }
>>>   
>>>   static bool ctx_single_port_submission(const struct intel_context *ce)
>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>          clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>   
>>>          WRITE_ONCE(execlists->active, execlists->inflight);
>>> +
>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>> +               struct intel_engine_cs *engine =
>>> +                       container_of(execlists, typeof(*engine), execlists);
>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>> +       }
>>>   }
>>>   
>>>   static inline void
>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>                          /* port0 completed, advanced to port1 */
>>>                          trace_ports(execlists, "completed", execlists->active);
>>>   
>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>> 
>> So this looses track if we preempt a dual-ELSP submission with a
>> single-ELSP submission (and never go back to dual).
>> 
>> If you move this to the end of the loop and check
>> 
>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>> 	intel_gt_pm_active_end(engine->gt);
>> 
>> so that it covers both preemption/promotion and completion.
>> 
>> However, that will fluctuate quite rapidly. (And runs the risk of
>> exceeding the sentinel.)
>> 
>> An alternative approach would be to couple along
>> schedule_in/schedule_out
>> 
>> atomic_set(overload, -1);
>> 
>> __execlists_schedule_in:
>> 	if (!atomic_fetch_inc(overload)
>> 		intel_gt_pm_active_begin(engine->gt);
>> __execlists_schedule_out:
>> 	if (!atomic_dec_return(overload)
>> 		intel_gt_pm_active_end(engine->gt);
>> 
>> which would mean we are overloaded as soon as we try to submit an
>> overlapping ELSP.
>
> Putting it this low-level into submission code also would not work well 
> with GuC.
>

I wrote a patch at some point that added calls to
intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC
submission code in order to obtain a similar effect.  However people
requested me to leave GuC submission alone for the moment in order to
avoid interference with SLPC.  At some point it might make sense to hook
this up in combination with SLPC, because SLPC doesn't provide much of a
CPU energy efficiency advantage in comparison to this series.

> How about we try to keep some accounting one level higher, as the i915 
> scheduler is passing requests on to the backend for execution?
>
> Or number of runnable contexts, if the distinction between contexts and 
> requests is better for this purpose.
>
> Problematic bit in going one level higher though is that the exit point 
> is less precisely coupled to the actual state. Or maybe with aggressive 
> engine retire we have nowadays it wouldn't be a problem.
>

The main advantage of instrumenting the execlists submission code at a
low level is that it gives us visibility over the number of ELSP ports
pending execution, which can cause the performance of the workload to be
substantially more or less latency-sensitive.  GuC submission shouldn't
care about this variable, so it kind of makes sense for its behavior to
be slightly different.

Anyway if we're willing to give up the accuracy of keeping track of this
at a low level (and give GuC submission exactly the same treatment) it
should be possible to move the tracking one level up.

> Regards,
>
> Tvrtko
>

Thank you.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2)
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (11 preceding siblings ...)
  2020-03-11  4:25 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
@ 2020-03-12  2:31 ` Patchwork
  2020-03-12  2:32 ` Patchwork
  2020-03-23 23:29 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
  14 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2020-03-12  2:31 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: intel-gfx

== Series Details ==

Series: GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2)
URL   : https://patchwork.freedesktop.org/series/74540/
State : failure

== Summary ==

Applying: PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
error: sha1 information is lacking or useless (include/linux/pm_qos.h).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2)
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (12 preceding siblings ...)
  2020-03-12  2:31 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2) Patchwork
@ 2020-03-12  2:32 ` Patchwork
  2020-03-23 23:29 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
  14 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2020-03-12  2:32 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: intel-gfx

== Series Details ==

Series: GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2)
URL   : https://patchwork.freedesktop.org/series/74540/
State : failure

== Summary ==

Applying: PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
error: sha1 information is lacking or useless (include/linux/pm_qos.h).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-11 19:54       ` Francisco Jerez
@ 2020-03-12 11:52         ` Tvrtko Ursulin
  2020-03-13  7:39           ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Tvrtko Ursulin @ 2020-03-12 11:52 UTC (permalink / raw)
  To: Francisco Jerez, Chris Wilson, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


On 11/03/2020 19:54, Francisco Jerez wrote:
> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
> 
>> On 10/03/2020 22:26, Chris Wilson wrote:
>>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> index b9b3f78f1324..a5d7a80b826d 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>>           /* we need to manually load the submit queue */
>>>>           if (execlists->ctrl_reg)
>>>>                   writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>>> +
>>>> +       if (execlists_num_ports(execlists) > 1 &&
>>> pending[1] is always defined, the minimum submission is one slot, with
>>> pending[1] as the sentinel NULL.
>>>
>>>> +           execlists->pending[1] &&
>>>> +           !atomic_xchg(&execlists->overload, 1))
>>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>>>
>>> engine->gt
>>>
>>>>    }
>>>>    
>>>>    static bool ctx_single_port_submission(const struct intel_context *ce)
>>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>>           clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>>    
>>>>           WRITE_ONCE(execlists->active, execlists->inflight);
>>>> +
>>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>>> +               struct intel_engine_cs *engine =
>>>> +                       container_of(execlists, typeof(*engine), execlists);
>>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>>> +       }
>>>>    }
>>>>    
>>>>    static inline void
>>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>>                           /* port0 completed, advanced to port1 */
>>>>                           trace_ports(execlists, "completed", execlists->active);
>>>>    
>>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>>>
>>> So this looses track if we preempt a dual-ELSP submission with a
>>> single-ELSP submission (and never go back to dual).
>>>
>>> If you move this to the end of the loop and check
>>>
>>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>>> 	intel_gt_pm_active_end(engine->gt);
>>>
>>> so that it covers both preemption/promotion and completion.
>>>
>>> However, that will fluctuate quite rapidly. (And runs the risk of
>>> exceeding the sentinel.)
>>>
>>> An alternative approach would be to couple along
>>> schedule_in/schedule_out
>>>
>>> atomic_set(overload, -1);
>>>
>>> __execlists_schedule_in:
>>> 	if (!atomic_fetch_inc(overload)
>>> 		intel_gt_pm_active_begin(engine->gt);
>>> __execlists_schedule_out:
>>> 	if (!atomic_dec_return(overload)
>>> 		intel_gt_pm_active_end(engine->gt);
>>>
>>> which would mean we are overloaded as soon as we try to submit an
>>> overlapping ELSP.
>>
>> Putting it this low-level into submission code also would not work well
>> with GuC.
>>
> 
> I wrote a patch at some point that added calls to
> intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC
> submission code in order to obtain a similar effect.  However people
> requested me to leave GuC submission alone for the moment in order to
> avoid interference with SLPC.  At some point it might make sense to hook
> this up in combination with SLPC, because SLPC doesn't provide much of a
> CPU energy efficiency advantage in comparison to this series.
> 
>> How about we try to keep some accounting one level higher, as the i915
>> scheduler is passing requests on to the backend for execution?
>>
>> Or number of runnable contexts, if the distinction between contexts and
>> requests is better for this purpose.
>>
>> Problematic bit in going one level higher though is that the exit point
>> is less precisely coupled to the actual state. Or maybe with aggressive
>> engine retire we have nowadays it wouldn't be a problem.
>>
> 
> The main advantage of instrumenting the execlists submission code at a
> low level is that it gives us visibility over the number of ELSP ports
> pending execution, which can cause the performance of the workload to be
> substantially more or less latency-sensitive.  GuC submission shouldn't
> care about this variable, so it kind of makes sense for its behavior to
> be slightly different.
> 
> Anyway if we're willing to give up the accuracy of keeping track of this
> at a low level (and give GuC submission exactly the same treatment) it
> should be possible to move the tracking one level up.

The results you got are certainly extremely attractive and the approach 
and code looks tidy and mature - just so you don't get me wrong that I 
am not objecting to the idea.

What I'd like to see is an easier to read breakdown of results, at 
minimum with separate perf and perf-per-Watt results. A graph with 
sorted results and error bars would also be nice.

Secondly in in the commit message of this particular patch I'd like to 
read some more thought about why ELSP[1] occupancy is thought to be the 
desired signal. Why for instance a deep ELSP[0] shouldn't benefit from 
more TDP budget towards the GPU and similar.

Also a description of the control processing "rf_qos" function do with 
this signal. What and why.

Some time ago we entertained the idea of GPU "load average", where that 
was defined as a count of runnable requests (so batch buffers). How 
that, more generic metric, would behave here if used as an input signal 
really intrigues me. Sadly I don't have a patch ready to give to you and 
ask to please test it.

Or maybe the key is count of runnable contexts as opposed to requests, 
which would more match the ELSP[1] idea.

But this is secondary, I primarily think we need to see a better 
presentation of the result and the theory of operation explained better 
in the commit message.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-12 11:52         ` Tvrtko Ursulin
@ 2020-03-13  7:39           ` Francisco Jerez
  2020-03-16 20:54             ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-13  7:39 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


[-- Attachment #1.1.1: Type: text/plain, Size: 9160 bytes --]

Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:

> On 11/03/2020 19:54, Francisco Jerez wrote:
>> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
>> 
>>> On 10/03/2020 22:26, Chris Wilson wrote:
>>>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>>> index b9b3f78f1324..a5d7a80b826d 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>>>           /* we need to manually load the submit queue */
>>>>>           if (execlists->ctrl_reg)
>>>>>                   writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>>>> +
>>>>> +       if (execlists_num_ports(execlists) > 1 &&
>>>> pending[1] is always defined, the minimum submission is one slot, with
>>>> pending[1] as the sentinel NULL.
>>>>
>>>>> +           execlists->pending[1] &&
>>>>> +           !atomic_xchg(&execlists->overload, 1))
>>>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>>>>
>>>> engine->gt
>>>>
>>>>>    }
>>>>>    
>>>>>    static bool ctx_single_port_submission(const struct intel_context *ce)
>>>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>>>           clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>>>    
>>>>>           WRITE_ONCE(execlists->active, execlists->inflight);
>>>>> +
>>>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>>>> +               struct intel_engine_cs *engine =
>>>>> +                       container_of(execlists, typeof(*engine), execlists);
>>>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>>>> +       }
>>>>>    }
>>>>>    
>>>>>    static inline void
>>>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>>>                           /* port0 completed, advanced to port1 */
>>>>>                           trace_ports(execlists, "completed", execlists->active);
>>>>>    
>>>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>>>>
>>>> So this looses track if we preempt a dual-ELSP submission with a
>>>> single-ELSP submission (and never go back to dual).
>>>>
>>>> If you move this to the end of the loop and check
>>>>
>>>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>>>> 	intel_gt_pm_active_end(engine->gt);
>>>>
>>>> so that it covers both preemption/promotion and completion.
>>>>
>>>> However, that will fluctuate quite rapidly. (And runs the risk of
>>>> exceeding the sentinel.)
>>>>
>>>> An alternative approach would be to couple along
>>>> schedule_in/schedule_out
>>>>
>>>> atomic_set(overload, -1);
>>>>
>>>> __execlists_schedule_in:
>>>> 	if (!atomic_fetch_inc(overload)
>>>> 		intel_gt_pm_active_begin(engine->gt);
>>>> __execlists_schedule_out:
>>>> 	if (!atomic_dec_return(overload)
>>>> 		intel_gt_pm_active_end(engine->gt);
>>>>
>>>> which would mean we are overloaded as soon as we try to submit an
>>>> overlapping ELSP.
>>>
>>> Putting it this low-level into submission code also would not work well
>>> with GuC.
>>>
>> 
>> I wrote a patch at some point that added calls to
>> intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC
>> submission code in order to obtain a similar effect.  However people
>> requested me to leave GuC submission alone for the moment in order to
>> avoid interference with SLPC.  At some point it might make sense to hook
>> this up in combination with SLPC, because SLPC doesn't provide much of a
>> CPU energy efficiency advantage in comparison to this series.
>> 
>>> How about we try to keep some accounting one level higher, as the i915
>>> scheduler is passing requests on to the backend for execution?
>>>
>>> Or number of runnable contexts, if the distinction between contexts and
>>> requests is better for this purpose.
>>>
>>> Problematic bit in going one level higher though is that the exit point
>>> is less precisely coupled to the actual state. Or maybe with aggressive
>>> engine retire we have nowadays it wouldn't be a problem.
>>>
>> 
>> The main advantage of instrumenting the execlists submission code at a
>> low level is that it gives us visibility over the number of ELSP ports
>> pending execution, which can cause the performance of the workload to be
>> substantially more or less latency-sensitive.  GuC submission shouldn't
>> care about this variable, so it kind of makes sense for its behavior to
>> be slightly different.
>> 
>> Anyway if we're willing to give up the accuracy of keeping track of this
>> at a low level (and give GuC submission exactly the same treatment) it
>> should be possible to move the tracking one level up.
>
> The results you got are certainly extremely attractive and the approach 
> and code looks tidy and mature - just so you don't get me wrong that I 
> am not objecting to the idea.
>
> What I'd like to see is an easier to read breakdown of results, at 
> minimum with separate perf and perf-per-Watt results. A graph with 
> sorted results and error bars would also be nice.
>

I just plotted the same results from the cover letter in separate
performance and energy efficiency graphs:

https://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL-perf.svg
https://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL-perf-per-watt.svg

> Secondly in in the commit message of this particular patch I'd like to 
> read some more thought about why ELSP[1] occupancy is thought to be the 
> desired signal. Why for instance a deep ELSP[0] shouldn't benefit from 
> more TDP budget towards the GPU and similar.
>
> Also a description of the control processing "rf_qos" function do with 
> this signal. What and why.
>

I'll work on a better commit message for v2.

> Some time ago we entertained the idea of GPU "load average", where that 
> was defined as a count of runnable requests (so batch buffers). How 
> that, more generic metric, would behave here if used as an input signal 
> really intrigues me. Sadly I don't have a patch ready to give to you and 
> ask to please test it.
>
> Or maybe the key is count of runnable contexts as opposed to requests, 
> which would more match the ELSP[1] idea.
>

Ultimately what we're trying to determine here is whether the
performance of the graphics workload is sensitive to the latency of the
CPU -- If it is we don't want to place a response latency constraint.
If the two ELSP ports are in use somewhere close to 100% of the time we
know that for most of the run-time of the workload the completion of one
request leads to the immediate execution of another, which means that
the GPU can be kept busy without the execlists submission code rushing
to submit a new requestt, so latency isn't a concern.

Looking at the number of runnable contexts is very close but not exactly
equivalent to that, since the workload may still be latency-sensitive if
the multiple contexts are only being submitted to a single port.

In the GuC submission case the CPU doesn't need to get involved to
submit the next piece of work (unless there is some cyclical dependency
between CPU and GPU work that is), so it should be sufficient to look at
whether at least one port is active -- Also even while using execlists,
there are applications which are able to keep some GPU engine busy
nearly 100% of the time (meaning that their performance won't increase
with decreasing latency since the engine can hardly do more work than
that), but they are unable to keep the two ELSP ports busy for some
significant fraction of that time, so it would be more accurate for them
to use the single-port utilization as heuristic (which yeah, is also
roughly equivalent to the fraction of time that at least one runnable
context or request was pending execution), at a cost for the
applications that are actually sensitive to the ELSP submission latency
we would be neglecting.

This patch takes the rather conservative approach of limiting the
application of the response frequency PM QoS request to the more
restrictive set of cases where we are most certain that CPU latency
shouldn't be an issue, in order to avoid regressions.  But it might be
that you find the additional energy efficiency benefit from the more
aggressive approach to be worth the cost to a few execlists submission
latency-sensitive applications.  I'm trying to get some numbers
comparing the two approaches now, will post them here once I have
results so we can make a more informed trade-off.

> But this is secondary, I primarily think we need to see a better 
> presentation of the result and the theory of operation explained better 
> in the commit message.
>

Sure, I'll see what I can come up with.

> Regards,
>
> Tvrtko

Thank you.


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-13  7:39           ` Francisco Jerez
@ 2020-03-16 20:54             ` Francisco Jerez
  0 siblings, 0 replies; 44+ messages in thread
From: Francisco Jerez @ 2020-03-16 20:54 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas, chris.p.wilson


[-- Attachment #1.1.1: Type: text/plain, Size: 7157 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
>[...]
>> Some time ago we entertained the idea of GPU "load average", where that 
>> was defined as a count of runnable requests (so batch buffers). How 
>> that, more generic metric, would behave here if used as an input signal 
>> really intrigues me. Sadly I don't have a patch ready to give to you and 
>> ask to please test it.
>>
>> Or maybe the key is count of runnable contexts as opposed to requests, 
>> which would more match the ELSP[1] idea.
>>
>[..]
> This patch takes the rather conservative approach of limiting the
> application of the response frequency PM QoS request to the more
> restrictive set of cases where we are most certain that CPU latency
> shouldn't be an issue, in order to avoid regressions.  But it might be
> that you find the additional energy efficiency benefit from the more
> aggressive approach to be worth the cost to a few execlists submission
> latency-sensitive applications.  I'm trying to get some numbers
> comparing the two approaches now, will post them here once I have
> results so we can make a more informed trade-off.
>

I got some results from the promised comparison between the dual-ELSP
utilization approach used in this series and the more obvious
alternative of keeping track of the time that any request (or context)
is in flight.  As expected there are quite a few performance
improvements (numbers relative to this approach), however most of them
are either synthetic benchmarks or off-screen variants of benchmarks
(the corresponding on-screen variant of each benchmark below doesn't
show a significant improvement):

 synmark/OglCSDof:                                                                      XXX ±0.15% x18 ->   XXX ±0.22% x12          d=1.15% ±0.18%       p=0.00%
 synmark/OglDeferred:                                                                   XXX ±0.31% x18 ->   XXX ±0.15% x12          d=1.16% ±0.26%       p=0.00%
 synmark/OglTexFilterAniso:                                                             XXX ±0.18% x18 ->   XXX ±0.21% x12          d=1.25% ±0.19%       p=0.00%
 synmark/OglPSPhong:                                                                    XXX ±0.43% x18 ->   XXX ±0.29% x12          d=1.28% ±0.38%       p=0.00%
 synmark/OglBatch0:                                                                     XXX ±0.40% x18 ->   XXX ±0.53% x12          d=1.29% ±0.46%       p=0.00%
 synmark/OglVSDiffuse8:                                                                 XXX ±0.49% x17 ->   XXX ±0.25% x12          d=1.30% ±0.41%       p=0.00%
 synmark/OglVSTangent:                                                                  XXX ±0.53% x18 ->   XXX ±0.31% x12          d=1.31% ±0.46%       p=0.00%
 synmark/OglGeomPoint:                                                                  XXX ±0.56% x18 ->   XXX ±0.15% x12          d=1.48% ±0.44%       p=0.00%
 gputest/plot3d:                                                                        XXX ±0.16% x18 ->   XXX ±0.11% x12          d=1.50% ±0.14%       p=0.00%
 gputest/tess_x32:                                                                      XXX ±0.15% x18 ->   XXX ±0.06% x12          d=1.59% ±0.13%       p=0.00%
 synmark/OglTexFilterTri:                                                               XXX ±0.15% x18 ->   XXX ±0.19% x12          d=1.62% ±0.17%       p=0.00%
 synmark/OglBatch3:                                                                     XXX ±0.57% x18 ->   XXX ±0.33% x12          d=1.70% ±0.49%       p=0.00%
 synmark/OglBatch1:                                                                     XXX ±0.41% x18 ->   XXX ±0.34% x12          d=1.81% ±0.38%       p=0.00%
 synmark/OglShMapVsm:                                                                   XXX ±0.53% x18 ->   XXX ±0.38% x12          d=1.81% ±0.48%       p=0.00%
 synmark/OglTexMem128:                                                                  XXX ±0.62% x18 ->   XXX ±0.29% x12          d=1.87% ±0.52%       p=0.00%
 phoronix/x11perf/test=Scrolling 500 x 500 px:                                           XXX ±0.35% x6 ->   XXX ±0.56% x12          d=2.23% ±0.52%       p=0.00%
 phoronix/x11perf/test=500px Copy From Window To Window:                                 XXX ±0.00% x3 ->   XXX ±0.74% x12          d=2.41% ±0.70%       p=0.01%
 gfxbench/gl_trex_off:                                                                   XXX ±0.04% x3 ->   XXX ±0.34% x12          d=2.59% ±0.32%       p=0.00%
 synmark/OglBatch2:                                                                     XXX ±0.85% x18 ->   XXX ±0.21% x12          d=2.87% ±0.67%       p=0.00%
 glbenchmark/GLB27_EgyptHD_inherited_C24Z16_FixedTime_Offscreen:                         XXX ±0.35% x3 ->   XXX ±0.84% x12          d=3.03% ±0.81%       p=0.01%
 glbenchmark/GLB27_TRex_C24Z16_Offscreen:                                                XXX ±0.23% x3 ->   XXX ±0.32% x12          d=3.09% ±0.32%       p=0.00%
 synmark/OglCSCloth:                                                                    XXX ±0.60% x18 ->   XXX ±0.29% x12          d=3.76% ±0.50%       p=0.00%
 phoronix/x11perf/test=Copy 500x500 From Pixmap To Pixmap:                               XXX ±0.44% x3 ->   XXX ±0.70% x12          d=4.31% ±0.69%       p=0.00%

There aren't as many regressions (numbers relative to upstream
linux-next kernel), they're mostly 2D test-cases, however they are
substantially worse in absolute value:

 phoronix/jxrendermark/rendering-test=12pt Text LCD/rendering-size=128x128:              XXX ±0.30% x26 ->  XXX ±5.71% x26        d=-23.15% ±3.11%       p=0.00%
 phoronix/jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:      XXX ±0.30% x26 ->  XXX ±4.32% x26        d=-21.34% ±2.41%       p=0.00%
 phoronix/x11perf/test=500px Compositing From Pixmap To Window:                         XXX ±15.46% x26 -> XXX ±12.76% x26       d=-19.05% ±13.15%       p=0.00%
 phoronix/jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128:  XXX ±0.20% x26 ->  XXX ±3.82% x27         d=-5.07% ±2.57%       p=0.00%
 phoronix/gtkperf/gtk-test=GtkDrawingArea - Pixbufs:                                     XXX ±2.81% x26 ->  XXX ±2.10% x26         d=-3.59% ±2.45%       p=0.00%
 warsow/benchsow:                                                                        XXX ±0.61% x26 ->  XXX ±1.41% x27         d=-2.45% ±1.07%       p=0.00%
 synmark/OglTerrainFlyInst:                                                              XXX ±0.44% x25 ->  XXX ±0.74% x25         d=-1.24% ±0.60%       p=0.00%

There are some things we might be able to do to get some of the
additional improvement we can see above without hurting
latency-sensitive workloads, but it's going to take more effort, the
present approach of using the dual-ELSP utilization seems like a good
compromise to me for starters.

>[...]

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts Francisco Jerez
@ 2020-03-17 23:59   ` Pandruvada, Srinivas
  2020-03-18 19:51     ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Pandruvada, Srinivas @ 2020-03-17 23:59 UTC (permalink / raw)
  To: linux-pm, currojerez, intel-gfx; +Cc: peterz, rjw

On Tue, 2020-03-10 at 14:42 -0700, Francisco Jerez wrote:
> This implements a simple variably low-pass-filtering governor in
> control of the HWP MIN/MAX PERF range based on the previously
> introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
> Implement VLP controller target P-state range estimation." for the
> rationale.

I just gave a try on a pretty idle system with just systemd processes
and usual background tasks with nomodset. 

I see that there HWP min is getting changed between 4-8. Why are
changing HWP dynamic range even on an idle system running no where
close to TDP?

Thanks,
Srinivas


> 
> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> ---
>  drivers/cpufreq/intel_pstate.c | 79
> +++++++++++++++++++++++++++++++++-
>  1 file changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c
> b/drivers/cpufreq/intel_pstate.c
> index cecadfec8bc1..a01eed40d897 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1905,6 +1905,20 @@ static void intel_pstate_reset_vlp(struct
> cpudata *cpu)
>  	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
>  	vlp->target.p_base = 0;
>  	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
> +
> +	if (hwp_active) {
> +		const uint32_t p0 = max(cpu->pstate.min_pstate,
> +					cpu->min_perf_ratio);
> +		const uint32_t p1 = max_t(uint32_t, p0, cpu-
> >max_perf_ratio);
> +		const uint64_t hwp_req = (READ_ONCE(cpu-
> >hwp_req_cached) &
> +					  ~(HWP_MAX_PERF(~0L) |
> +					    HWP_MIN_PERF(~0L) |
> +					    HWP_DESIRED_PERF(~0L))) |
> +					 HWP_MIN_PERF(p0) |
> HWP_MAX_PERF(p1);
> +
> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
> +		cpu->hwp_req_cached = hwp_req;
> +	}
>  }
>  
>  /**
> @@ -2222,6 +2236,46 @@ static void intel_pstate_adjust_pstate(struct
> cpudata *cpu)
>  		fp_toint(cpu->iowait_boost * 100));
>  }
>  
> +static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
> +					     const unsigned int
> range[])
> +{
> +	const int from = cpu->hwp_req_cached;
> +	unsigned int p0, p1, p_min, p_max;
> +	struct sample *sample;
> +	uint64_t hwp_req;
> +
> +	update_turbo_state();
> +
> +	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
> +	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
> +	p_min = clamp_t(unsigned int, range[0], p0, p1);
> +	p_max = clamp_t(unsigned int, range[1], p0, p1);
> +
> +	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
> +
> +	hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
> +		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
> +		     HWP_DESIRED_PERF(~0L))) |
> +		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
> +		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
> +
> +	if (hwp_req != cpu->hwp_req_cached) {
> +		wrmsrl(MSR_HWP_REQUEST, hwp_req);
> +		cpu->hwp_req_cached = hwp_req;
> +	}
> +
> +	sample = &cpu->sample;
> +	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
> +			    fp_toint(sample->busy_scaled),
> +			    from,
> +			    hwp_req,
> +			    sample->mperf,
> +			    sample->aperf,
> +			    sample->tsc,
> +			    get_avg_frequency(cpu),
> +			    fp_toint(cpu->iowait_boost * 100));
> +}
> +
>  static void intel_pstate_update_util(struct update_util_data *data,
> u64 time,
>  				     unsigned int flags)
>  {
> @@ -2260,6 +2314,22 @@ static void intel_pstate_update_util(struct
> update_util_data *data, u64 time,
>  		intel_pstate_adjust_pstate(cpu);
>  }
>  
> +/**
> + * Implementation of the cpufreq update_util hook based on the VLP
> + * controller (see get_vlp_target_range()).
> + */
> +static void intel_pstate_update_util_hwp_vlp(struct update_util_data
> *data,
> +					     u64 time, unsigned int
> flags)
> +{
> +	struct cpudata *cpu = container_of(data, struct cpudata,
> update_util);
> +
> +	if (update_vlp_sample(cpu, time, flags)) {
> +		const struct vlp_target_range *target =
> +			get_vlp_target_range(cpu);
> +		intel_pstate_adjust_pstate_range(cpu, target->value);
> +	}
> +}
> +
>  static struct pstate_funcs core_funcs = {
>  	.get_max = core_get_max_pstate,
>  	.get_max_physical = core_get_max_pstate_physical,
> @@ -2389,6 +2459,9 @@ static int intel_pstate_init_cpu(unsigned int
> cpunum)
>  
>  	intel_pstate_get_cpu_pstates(cpu);
>  
> +	if (pstate_funcs.update_util ==
> intel_pstate_update_util_hwp_vlp)
> +		intel_pstate_reset_vlp(cpu);
> +
>  	pr_debug("controlling: cpu %d\n", cpunum);
>  
>  	return 0;
> @@ -2398,7 +2471,8 @@ static void
> intel_pstate_set_update_util_hook(unsigned int cpu_num)
>  {
>  	struct cpudata *cpu = all_cpu_data[cpu_num];
>  
> -	if (hwp_active && !hwp_boost)
> +	if (hwp_active && !hwp_boost &&
> +	    pstate_funcs.update_util !=
> intel_pstate_update_util_hwp_vlp)
>  		return;
>  
>  	if (cpu->update_util_set)
> @@ -2526,7 +2600,8 @@ static int intel_pstate_set_policy(struct
> cpufreq_policy *policy)
>  		 * was turned off, in that case we need to clear the
>  		 * update util hook.
>  		 */
> -		if (!hwp_boost)
> +		if (!hwp_boost && pstate_funcs.update_util !=
> +				  intel_pstate_update_util_hwp_vlp)
>  			intel_pstate_clear_update_util_hook(policy-
> >cpu);
>  		intel_pstate_hwp_set(policy->cpu);
>  	}
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-11  0:34     ` Francisco Jerez
@ 2020-03-18 19:42       ` Francisco Jerez
  2020-03-20  2:46         ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-18 19:42 UTC (permalink / raw)
  To: chris.p.wilson, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


[-- Attachment #1.1.1: Type: text/plain, Size: 3850 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> index b9b3f78f1324..a5d7a80b826d 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>         /* we need to manually load the submit queue */
>>>         if (execlists->ctrl_reg)
>>>                 writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>> +
>>> +       if (execlists_num_ports(execlists) > 1 &&
>> pending[1] is always defined, the minimum submission is one slot, with
>> pending[1] as the sentinel NULL.
>>
>>> +           execlists->pending[1] &&
>>> +           !atomic_xchg(&execlists->overload, 1))
>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>>
>> engine->gt
>>
>
> Applied your suggestions above locally, will probably wait to have a few
> more changes batched up before sending a v2.
>
>>>  }
>>>  
>>>  static bool ctx_single_port_submission(const struct intel_context *ce)
>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>         clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>  
>>>         WRITE_ONCE(execlists->active, execlists->inflight);
>>> +
>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>> +               struct intel_engine_cs *engine =
>>> +                       container_of(execlists, typeof(*engine), execlists);
>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>> +       }
>>>  }
>>>  
>>>  static inline void
>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>                         /* port0 completed, advanced to port1 */
>>>                         trace_ports(execlists, "completed", execlists->active);
>>>  
>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>>
>> So this looses track if we preempt a dual-ELSP submission with a
>> single-ELSP submission (and never go back to dual).
>>
>
> Yes, good point.  You're right that if a dual-ELSP submission gets
> preempted by a single-ELSP submission "overload" will remain signaled
> until the first completion interrupt arrives (e.g. from the preempting
> submission).
>
>> If you move this to the end of the loop and check
>>
>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>> 	intel_gt_pm_active_end(engine->gt);
>>
>> so that it covers both preemption/promotion and completion.
>>
>
> That sounds reasonable.
>
>> However, that will fluctuate quite rapidly. (And runs the risk of
>> exceeding the sentinel.)
>>
>> An alternative approach would be to couple along
>> schedule_in/schedule_out
>>
>> atomic_set(overload, -1);
>>
>> __execlists_schedule_in:
>> 	if (!atomic_fetch_inc(overload)
>> 		intel_gt_pm_active_begin(engine->gt);
>> __execlists_schedule_out:
>> 	if (!atomic_dec_return(overload)
>> 		intel_gt_pm_active_end(engine->gt);
>>
>> which would mean we are overloaded as soon as we try to submit an
>> overlapping ELSP.
>>
>
> That sounds good to me too, and AFAICT would have roughly the same
> behavior as this metric except for the preemption corner case you
> mention above.  I'll try this and verify that I get approximately the
> same performance numbers.
>

This suggestion seems to lead to some minor regressions, I'm
investigating the issue.  Will send a v2 as soon as I have something
along the lines of what you suggested running with equivalent
performance to v1.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-03-17 23:59   ` Pandruvada, Srinivas
@ 2020-03-18 19:51     ` Francisco Jerez
  2020-03-18 20:10       ` Pandruvada, Srinivas
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-18 19:51 UTC (permalink / raw)
  To: Pandruvada, Srinivas, linux-pm, intel-gfx; +Cc: peterz, rjw


[-- Attachment #1.1.1: Type: text/plain, Size: 5966 bytes --]

"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:

> On Tue, 2020-03-10 at 14:42 -0700, Francisco Jerez wrote:
>> This implements a simple variably low-pass-filtering governor in
>> control of the HWP MIN/MAX PERF range based on the previously
>> introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
>> Implement VLP controller target P-state range estimation." for the
>> rationale.
>
> I just gave a try on a pretty idle system with just systemd processes
> and usual background tasks with nomodset. 
>
> I see that there HWP min is getting changed between 4-8. Why are
> changing HWP dynamic range even on an idle system running no where
> close to TDP?
>

The HWP request range is clamped to the frequency range specified by the
CPUFREQ policy and to the cpu->pstate.min_pstate bound.

If you see the HWP minimum fluctuating above that it's likely a sign of
your system not being completely idle -- If that's the case it's likely
to go away after you do:

 echo 0 > /sys/kernel/debug/pstate_snb/vlp_realtime_gain_pml

> Thanks,
> Srinivas
>
>
>> 
>> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
>> ---
>>  drivers/cpufreq/intel_pstate.c | 79
>> +++++++++++++++++++++++++++++++++-
>>  1 file changed, 77 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/cpufreq/intel_pstate.c
>> b/drivers/cpufreq/intel_pstate.c
>> index cecadfec8bc1..a01eed40d897 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -1905,6 +1905,20 @@ static void intel_pstate_reset_vlp(struct
>> cpudata *cpu)
>>  	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
>>  	vlp->target.p_base = 0;
>>  	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
>> +
>> +	if (hwp_active) {
>> +		const uint32_t p0 = max(cpu->pstate.min_pstate,
>> +					cpu->min_perf_ratio);
>> +		const uint32_t p1 = max_t(uint32_t, p0, cpu-
>> >max_perf_ratio);
>> +		const uint64_t hwp_req = (READ_ONCE(cpu-
>> >hwp_req_cached) &
>> +					  ~(HWP_MAX_PERF(~0L) |
>> +					    HWP_MIN_PERF(~0L) |
>> +					    HWP_DESIRED_PERF(~0L))) |
>> +					 HWP_MIN_PERF(p0) |
>> HWP_MAX_PERF(p1);
>> +
>> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
>> +		cpu->hwp_req_cached = hwp_req;
>> +	}
>>  }
>>  
>>  /**
>> @@ -2222,6 +2236,46 @@ static void intel_pstate_adjust_pstate(struct
>> cpudata *cpu)
>>  		fp_toint(cpu->iowait_boost * 100));
>>  }
>>  
>> +static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
>> +					     const unsigned int
>> range[])
>> +{
>> +	const int from = cpu->hwp_req_cached;
>> +	unsigned int p0, p1, p_min, p_max;
>> +	struct sample *sample;
>> +	uint64_t hwp_req;
>> +
>> +	update_turbo_state();
>> +
>> +	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
>> +	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
>> +	p_min = clamp_t(unsigned int, range[0], p0, p1);
>> +	p_max = clamp_t(unsigned int, range[1], p0, p1);
>> +
>> +	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
>> +
>> +	hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
>> +		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
>> +		     HWP_DESIRED_PERF(~0L))) |
>> +		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
>> +		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
>> +
>> +	if (hwp_req != cpu->hwp_req_cached) {
>> +		wrmsrl(MSR_HWP_REQUEST, hwp_req);
>> +		cpu->hwp_req_cached = hwp_req;
>> +	}
>> +
>> +	sample = &cpu->sample;
>> +	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
>> +			    fp_toint(sample->busy_scaled),
>> +			    from,
>> +			    hwp_req,
>> +			    sample->mperf,
>> +			    sample->aperf,
>> +			    sample->tsc,
>> +			    get_avg_frequency(cpu),
>> +			    fp_toint(cpu->iowait_boost * 100));
>> +}
>> +
>>  static void intel_pstate_update_util(struct update_util_data *data,
>> u64 time,
>>  				     unsigned int flags)
>>  {
>> @@ -2260,6 +2314,22 @@ static void intel_pstate_update_util(struct
>> update_util_data *data, u64 time,
>>  		intel_pstate_adjust_pstate(cpu);
>>  }
>>  
>> +/**
>> + * Implementation of the cpufreq update_util hook based on the VLP
>> + * controller (see get_vlp_target_range()).
>> + */
>> +static void intel_pstate_update_util_hwp_vlp(struct update_util_data
>> *data,
>> +					     u64 time, unsigned int
>> flags)
>> +{
>> +	struct cpudata *cpu = container_of(data, struct cpudata,
>> update_util);
>> +
>> +	if (update_vlp_sample(cpu, time, flags)) {
>> +		const struct vlp_target_range *target =
>> +			get_vlp_target_range(cpu);
>> +		intel_pstate_adjust_pstate_range(cpu, target->value);
>> +	}
>> +}
>> +
>>  static struct pstate_funcs core_funcs = {
>>  	.get_max = core_get_max_pstate,
>>  	.get_max_physical = core_get_max_pstate_physical,
>> @@ -2389,6 +2459,9 @@ static int intel_pstate_init_cpu(unsigned int
>> cpunum)
>>  
>>  	intel_pstate_get_cpu_pstates(cpu);
>>  
>> +	if (pstate_funcs.update_util ==
>> intel_pstate_update_util_hwp_vlp)
>> +		intel_pstate_reset_vlp(cpu);
>> +
>>  	pr_debug("controlling: cpu %d\n", cpunum);
>>  
>>  	return 0;
>> @@ -2398,7 +2471,8 @@ static void
>> intel_pstate_set_update_util_hook(unsigned int cpu_num)
>>  {
>>  	struct cpudata *cpu = all_cpu_data[cpu_num];
>>  
>> -	if (hwp_active && !hwp_boost)
>> +	if (hwp_active && !hwp_boost &&
>> +	    pstate_funcs.update_util !=
>> intel_pstate_update_util_hwp_vlp)
>>  		return;
>>  
>>  	if (cpu->update_util_set)
>> @@ -2526,7 +2600,8 @@ static int intel_pstate_set_policy(struct
>> cpufreq_policy *policy)
>>  		 * was turned off, in that case we need to clear the
>>  		 * update util hook.
>>  		 */
>> -		if (!hwp_boost)
>> +		if (!hwp_boost && pstate_funcs.update_util !=
>> +				  intel_pstate_update_util_hwp_vlp)
>>  			intel_pstate_clear_update_util_hook(policy-
>> >cpu);
>>  		intel_pstate_hwp_set(policy->cpu);
>>  	}

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-03-18 19:51     ` Francisco Jerez
@ 2020-03-18 20:10       ` Pandruvada, Srinivas
  2020-03-18 20:22         ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Pandruvada, Srinivas @ 2020-03-18 20:10 UTC (permalink / raw)
  To: linux-pm, currojerez, intel-gfx; +Cc: peterz, rjw

On Wed, 2020-03-18 at 12:51 -0700, Francisco Jerez wrote:
> "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
> 
> > On Tue, 2020-03-10 at 14:42 -0700, Francisco Jerez wrote:
> > > This implements a simple variably low-pass-filtering governor in
> > > control of the HWP MIN/MAX PERF range based on the previously
> > > introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
> > > Implement VLP controller target P-state range estimation." for
> > > the
> > > rationale.
> > 
> > I just gave a try on a pretty idle system with just systemd
> > processes
> > and usual background tasks with nomodset. 
> > 
> > I see that there HWP min is getting changed between 4-8. Why are
> > changing HWP dynamic range even on an idle system running no where
> > close to TDP?
> > 
> 
> The HWP request range is clamped to the frequency range specified by
> the
> CPUFREQ policy and to the cpu->pstate.min_pstate bound.
> 
> If you see the HWP minimum fluctuating above that it's likely a sign
> of
> your system not being completely idle -- If that's the case it's
> likely
> to go away after you do:
> 
>  echo 0 > /sys/kernel/debug/pstate_snb/vlp_realtime_gain_pml
> 
The objective which I though was to improve performance of GPU
workloads limited by TDP because of P-states ramping up and resulting
in less power to GPU to complete a task.
 
HWP takes decision not on just load on a CPU but several other factors
like total SoC power and scalability. We don't want to disturb HWP
algorithms when there is no TDP limitations. If writing 0, causes this
behavior then that should be the default.

Thanks,
Srinivas





> > Thanks,
> > Srinivas
> > 
> > 
> > > Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> > > ---
> > >  drivers/cpufreq/intel_pstate.c | 79
> > > +++++++++++++++++++++++++++++++++-
> > >  1 file changed, 77 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > b/drivers/cpufreq/intel_pstate.c
> > > index cecadfec8bc1..a01eed40d897 100644
> > > --- a/drivers/cpufreq/intel_pstate.c
> > > +++ b/drivers/cpufreq/intel_pstate.c
> > > @@ -1905,6 +1905,20 @@ static void intel_pstate_reset_vlp(struct
> > > cpudata *cpu)
> > >  	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
> > >  	vlp->target.p_base = 0;
> > >  	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
> > > +
> > > +	if (hwp_active) {
> > > +		const uint32_t p0 = max(cpu->pstate.min_pstate,
> > > +					cpu->min_perf_ratio);
> > > +		const uint32_t p1 = max_t(uint32_t, p0, cpu-
> > > > max_perf_ratio);
> > > +		const uint64_t hwp_req = (READ_ONCE(cpu-
> > > > hwp_req_cached) &
> > > +					  ~(HWP_MAX_PERF(~0L) |
> > > +					    HWP_MIN_PERF(~0L) |
> > > +					    HWP_DESIRED_PERF(~0L))) |
> > > +					 HWP_MIN_PERF(p0) |
> > > HWP_MAX_PERF(p1);
> > > +
> > > +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
> > > +		cpu->hwp_req_cached = hwp_req;
> > > +	}
> > >  }
> > >  
> > >  /**
> > > @@ -2222,6 +2236,46 @@ static void
> > > intel_pstate_adjust_pstate(struct
> > > cpudata *cpu)
> > >  		fp_toint(cpu->iowait_boost * 100));
> > >  }
> > >  
> > > +static void intel_pstate_adjust_pstate_range(struct cpudata
> > > *cpu,
> > > +					     const unsigned int
> > > range[])
> > > +{
> > > +	const int from = cpu->hwp_req_cached;
> > > +	unsigned int p0, p1, p_min, p_max;
> > > +	struct sample *sample;
> > > +	uint64_t hwp_req;
> > > +
> > > +	update_turbo_state();
> > > +
> > > +	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
> > > +	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
> > > +	p_min = clamp_t(unsigned int, range[0], p0, p1);
> > > +	p_max = clamp_t(unsigned int, range[1], p0, p1);
> > > +
> > > +	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
> > > +
> > > +	hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
> > > +		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
> > > +		     HWP_DESIRED_PERF(~0L))) |
> > > +		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
> > > +		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
> > > +
> > > +	if (hwp_req != cpu->hwp_req_cached) {
> > > +		wrmsrl(MSR_HWP_REQUEST, hwp_req);
> > > +		cpu->hwp_req_cached = hwp_req;
> > > +	}
> > > +
> > > +	sample = &cpu->sample;
> > > +	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
> > > +			    fp_toint(sample->busy_scaled),
> > > +			    from,
> > > +			    hwp_req,
> > > +			    sample->mperf,
> > > +			    sample->aperf,
> > > +			    sample->tsc,
> > > +			    get_avg_frequency(cpu),
> > > +			    fp_toint(cpu->iowait_boost * 100));
> > > +}
> > > +
> > >  static void intel_pstate_update_util(struct update_util_data
> > > *data,
> > > u64 time,
> > >  				     unsigned int flags)
> > >  {
> > > @@ -2260,6 +2314,22 @@ static void
> > > intel_pstate_update_util(struct
> > > update_util_data *data, u64 time,
> > >  		intel_pstate_adjust_pstate(cpu);
> > >  }
> > >  
> > > +/**
> > > + * Implementation of the cpufreq update_util hook based on the
> > > VLP
> > > + * controller (see get_vlp_target_range()).
> > > + */
> > > +static void intel_pstate_update_util_hwp_vlp(struct
> > > update_util_data
> > > *data,
> > > +					     u64 time, unsigned int
> > > flags)
> > > +{
> > > +	struct cpudata *cpu = container_of(data, struct cpudata,
> > > update_util);
> > > +
> > > +	if (update_vlp_sample(cpu, time, flags)) {
> > > +		const struct vlp_target_range *target =
> > > +			get_vlp_target_range(cpu);
> > > +		intel_pstate_adjust_pstate_range(cpu, target->value);
> > > +	}
> > > +}
> > > +
> > >  static struct pstate_funcs core_funcs = {
> > >  	.get_max = core_get_max_pstate,
> > >  	.get_max_physical = core_get_max_pstate_physical,
> > > @@ -2389,6 +2459,9 @@ static int intel_pstate_init_cpu(unsigned
> > > int
> > > cpunum)
> > >  
> > >  	intel_pstate_get_cpu_pstates(cpu);
> > >  
> > > +	if (pstate_funcs.update_util ==
> > > intel_pstate_update_util_hwp_vlp)
> > > +		intel_pstate_reset_vlp(cpu);
> > > +
> > >  	pr_debug("controlling: cpu %d\n", cpunum);
> > >  
> > >  	return 0;
> > > @@ -2398,7 +2471,8 @@ static void
> > > intel_pstate_set_update_util_hook(unsigned int cpu_num)
> > >  {
> > >  	struct cpudata *cpu = all_cpu_data[cpu_num];
> > >  
> > > -	if (hwp_active && !hwp_boost)
> > > +	if (hwp_active && !hwp_boost &&
> > > +	    pstate_funcs.update_util !=
> > > intel_pstate_update_util_hwp_vlp)
> > >  		return;
> > >  
> > >  	if (cpu->update_util_set)
> > > @@ -2526,7 +2600,8 @@ static int intel_pstate_set_policy(struct
> > > cpufreq_policy *policy)
> > >  		 * was turned off, in that case we need to clear the
> > >  		 * update util hook.
> > >  		 */
> > > -		if (!hwp_boost)
> > > +		if (!hwp_boost && pstate_funcs.update_util !=
> > > +				  intel_pstate_update_util_hwp_vlp)
> > >  			intel_pstate_clear_update_util_hook(policy-
> > > > cpu);
> > >  		intel_pstate_hwp_set(policy->cpu);
> > >  	}
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-03-18 20:10       ` Pandruvada, Srinivas
@ 2020-03-18 20:22         ` Francisco Jerez
  2020-03-23 20:13           ` Pandruvada, Srinivas
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-18 20:22 UTC (permalink / raw)
  To: Pandruvada, Srinivas, linux-pm, intel-gfx; +Cc: peterz, rjw


[-- Attachment #1.1.1: Type: text/plain, Size: 7881 bytes --]

"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:

> On Wed, 2020-03-18 at 12:51 -0700, Francisco Jerez wrote:
>> "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
>> 
>> > On Tue, 2020-03-10 at 14:42 -0700, Francisco Jerez wrote:
>> > > This implements a simple variably low-pass-filtering governor in
>> > > control of the HWP MIN/MAX PERF range based on the previously
>> > > introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
>> > > Implement VLP controller target P-state range estimation." for
>> > > the
>> > > rationale.
>> > 
>> > I just gave a try on a pretty idle system with just systemd
>> > processes
>> > and usual background tasks with nomodset. 
>> > 
>> > I see that there HWP min is getting changed between 4-8. Why are
>> > changing HWP dynamic range even on an idle system running no where
>> > close to TDP?
>> > 
>> 
>> The HWP request range is clamped to the frequency range specified by
>> the
>> CPUFREQ policy and to the cpu->pstate.min_pstate bound.
>> 
>> If you see the HWP minimum fluctuating above that it's likely a sign
>> of
>> your system not being completely idle -- If that's the case it's
>> likely
>> to go away after you do:
>> 
>>  echo 0 > /sys/kernel/debug/pstate_snb/vlp_realtime_gain_pml
>> 
> The objective which I though was to improve performance of GPU
> workloads limited by TDP because of P-states ramping up and resulting
> in less power to GPU to complete a task.
>  
> HWP takes decision not on just load on a CPU but several other factors
> like total SoC power and scalability. We don't want to disturb HWP
> algorithms when there is no TDP limitations. If writing 0, causes this
> behavior then that should be the default.
>

The heuristic disabled by that debugfs file is there to avoid
regressions in latency-sensitive workloads as you can probably get from
the ecomments.  However ISTR those regressions were specific to non-HWP
systems, so I wouldn't mind disabling it for the moment (or punting it
to the non-HWP series if you like)j.  But first I need to verify that
there are no performance regressions on HWP systems after changing that.
Can you confirm that the debugfs write above prevents the behavior you'd
like to avoid?

> Thanks,
> Srinivas
>
>
>
>
>
>> > Thanks,
>> > Srinivas
>> > 
>> > 
>> > > Signed-off-by: Francisco Jerez <currojerez@riseup.net>
>> > > ---
>> > >  drivers/cpufreq/intel_pstate.c | 79
>> > > +++++++++++++++++++++++++++++++++-
>> > >  1 file changed, 77 insertions(+), 2 deletions(-)
>> > > 
>> > > diff --git a/drivers/cpufreq/intel_pstate.c
>> > > b/drivers/cpufreq/intel_pstate.c
>> > > index cecadfec8bc1..a01eed40d897 100644
>> > > --- a/drivers/cpufreq/intel_pstate.c
>> > > +++ b/drivers/cpufreq/intel_pstate.c
>> > > @@ -1905,6 +1905,20 @@ static void intel_pstate_reset_vlp(struct
>> > > cpudata *cpu)
>> > >  	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
>> > >  	vlp->target.p_base = 0;
>> > >  	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
>> > > +
>> > > +	if (hwp_active) {
>> > > +		const uint32_t p0 = max(cpu->pstate.min_pstate,
>> > > +					cpu->min_perf_ratio);
>> > > +		const uint32_t p1 = max_t(uint32_t, p0, cpu-
>> > > > max_perf_ratio);
>> > > +		const uint64_t hwp_req = (READ_ONCE(cpu-
>> > > > hwp_req_cached) &
>> > > +					  ~(HWP_MAX_PERF(~0L) |
>> > > +					    HWP_MIN_PERF(~0L) |
>> > > +					    HWP_DESIRED_PERF(~0L))) |
>> > > +					 HWP_MIN_PERF(p0) |
>> > > HWP_MAX_PERF(p1);
>> > > +
>> > > +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
>> > > +		cpu->hwp_req_cached = hwp_req;
>> > > +	}
>> > >  }
>> > >  
>> > >  /**
>> > > @@ -2222,6 +2236,46 @@ static void
>> > > intel_pstate_adjust_pstate(struct
>> > > cpudata *cpu)
>> > >  		fp_toint(cpu->iowait_boost * 100));
>> > >  }
>> > >  
>> > > +static void intel_pstate_adjust_pstate_range(struct cpudata
>> > > *cpu,
>> > > +					     const unsigned int
>> > > range[])
>> > > +{
>> > > +	const int from = cpu->hwp_req_cached;
>> > > +	unsigned int p0, p1, p_min, p_max;
>> > > +	struct sample *sample;
>> > > +	uint64_t hwp_req;
>> > > +
>> > > +	update_turbo_state();
>> > > +
>> > > +	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
>> > > +	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
>> > > +	p_min = clamp_t(unsigned int, range[0], p0, p1);
>> > > +	p_max = clamp_t(unsigned int, range[1], p0, p1);
>> > > +
>> > > +	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
>> > > +
>> > > +	hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
>> > > +		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
>> > > +		     HWP_DESIRED_PERF(~0L))) |
>> > > +		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
>> > > +		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
>> > > +
>> > > +	if (hwp_req != cpu->hwp_req_cached) {
>> > > +		wrmsrl(MSR_HWP_REQUEST, hwp_req);
>> > > +		cpu->hwp_req_cached = hwp_req;
>> > > +	}
>> > > +
>> > > +	sample = &cpu->sample;
>> > > +	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
>> > > +			    fp_toint(sample->busy_scaled),
>> > > +			    from,
>> > > +			    hwp_req,
>> > > +			    sample->mperf,
>> > > +			    sample->aperf,
>> > > +			    sample->tsc,
>> > > +			    get_avg_frequency(cpu),
>> > > +			    fp_toint(cpu->iowait_boost * 100));
>> > > +}
>> > > +
>> > >  static void intel_pstate_update_util(struct update_util_data
>> > > *data,
>> > > u64 time,
>> > >  				     unsigned int flags)
>> > >  {
>> > > @@ -2260,6 +2314,22 @@ static void
>> > > intel_pstate_update_util(struct
>> > > update_util_data *data, u64 time,
>> > >  		intel_pstate_adjust_pstate(cpu);
>> > >  }
>> > >  
>> > > +/**
>> > > + * Implementation of the cpufreq update_util hook based on the
>> > > VLP
>> > > + * controller (see get_vlp_target_range()).
>> > > + */
>> > > +static void intel_pstate_update_util_hwp_vlp(struct
>> > > update_util_data
>> > > *data,
>> > > +					     u64 time, unsigned int
>> > > flags)
>> > > +{
>> > > +	struct cpudata *cpu = container_of(data, struct cpudata,
>> > > update_util);
>> > > +
>> > > +	if (update_vlp_sample(cpu, time, flags)) {
>> > > +		const struct vlp_target_range *target =
>> > > +			get_vlp_target_range(cpu);
>> > > +		intel_pstate_adjust_pstate_range(cpu, target->value);
>> > > +	}
>> > > +}
>> > > +
>> > >  static struct pstate_funcs core_funcs = {
>> > >  	.get_max = core_get_max_pstate,
>> > >  	.get_max_physical = core_get_max_pstate_physical,
>> > > @@ -2389,6 +2459,9 @@ static int intel_pstate_init_cpu(unsigned
>> > > int
>> > > cpunum)
>> > >  
>> > >  	intel_pstate_get_cpu_pstates(cpu);
>> > >  
>> > > +	if (pstate_funcs.update_util ==
>> > > intel_pstate_update_util_hwp_vlp)
>> > > +		intel_pstate_reset_vlp(cpu);
>> > > +
>> > >  	pr_debug("controlling: cpu %d\n", cpunum);
>> > >  
>> > >  	return 0;
>> > > @@ -2398,7 +2471,8 @@ static void
>> > > intel_pstate_set_update_util_hook(unsigned int cpu_num)
>> > >  {
>> > >  	struct cpudata *cpu = all_cpu_data[cpu_num];
>> > >  
>> > > -	if (hwp_active && !hwp_boost)
>> > > +	if (hwp_active && !hwp_boost &&
>> > > +	    pstate_funcs.update_util !=
>> > > intel_pstate_update_util_hwp_vlp)
>> > >  		return;
>> > >  
>> > >  	if (cpu->update_util_set)
>> > > @@ -2526,7 +2600,8 @@ static int intel_pstate_set_policy(struct
>> > > cpufreq_policy *policy)
>> > >  		 * was turned off, in that case we need to clear the
>> > >  		 * update util hook.
>> > >  		 */
>> > > -		if (!hwp_boost)
>> > > +		if (!hwp_boost && pstate_funcs.update_util !=
>> > > +				  intel_pstate_update_util_hwp_vlp)
>> > >  			intel_pstate_clear_update_util_hook(policy-
>> > > > cpu);
>> > >  		intel_pstate_hwp_set(policy->cpu);
>> > >  	}

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCHv2 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit.
  2020-03-11 19:23       ` [Intel-gfx] [PATCHv2 " Francisco Jerez
@ 2020-03-19 10:25         ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2020-03-19 10:25 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: Peter Zijlstra, intel-gfx, Pandruvada, Srinivas, linux-pm

On Wednesday, March 11, 2020 8:23:19 PM CET Francisco Jerez wrote:
> The purpose of this PM QoS limit is to give device drivers additional
> control over the latency/energy efficiency trade-off made by the PM
> subsystem (particularly the CPUFREQ governor).  It allows device
> drivers to set a lower bound on the response latency of PM (defined as
> the time it takes from wake-up to the CPU reaching a certain
> steady-state level of performance [e.g. the nominal frequency] in
> response to a step-function load).  It reports to PM the minimum
> ramp-up latency considered of use to the application, and explicitly
> requests PM to filter out oscillations faster than the specified
> frequency.  It is somewhat complementary to the current
> CPU_DMA_LATENCY PM QoS class which can be understood as specifying an
> upper latency bound on the CPU wake-up time, instead of a lower bound
> on the CPU frequency ramp-up time.
> 
> Note that even though this provides a latency constraint it's
> represented as its reciprocal in Hz units for computational efficiency
> (since it would take a 64-bit division to compute the number of cycles
> elapsed from a time increment in nanoseconds and a time bound, while a
> frequency can simply be multiplied with the time increment).
> 
> This implements a MAX constraint so that the strictest (highest
> response frequency) request is honored.  This means that PM won't
> provide any guarantee that frequencies greater than the specified
> bound will be filtered, since that might be incompatible with the
> constraints specified by another more latency-sensitive application (A
> more fine-grained result could be achieved with a scheduling-based
> interface).  The default value needs to be equal to zero (best effort)
> for it to behave as identity of the MAX operation.
> 
> v2: Drop wake_up_all_idle_cpus() call from
>     cpu_response_frequency_qos_apply() (Peter).
> 
> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> ---
>  include/linux/pm_qos.h       |   9 +++
>  include/trace/events/power.h |  33 +++++----
>  kernel/power/qos.c           | 138 ++++++++++++++++++++++++++++++++++-

First, the documentation (Documentation/power/pm_qos_interface.rst) needs to be
updated too to cover the new QoS category.

>  3 files changed, 162 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
> index 4a69d4af3ff8..b522e2194c05 100644
> --- a/include/linux/pm_qos.h
> +++ b/include/linux/pm_qos.h
> @@ -28,6 +28,7 @@ enum pm_qos_flags_status {
>  #define PM_QOS_LATENCY_ANY_NS	((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC)
>  
>  #define PM_QOS_CPU_LATENCY_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
> +#define PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE 0

I would call this PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE and all of the
API pieces accordingly.

>  #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	PM_QOS_LATENCY_ANY
>  #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT	PM_QOS_LATENCY_ANY
>  #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS	PM_QOS_LATENCY_ANY_NS
> @@ -162,6 +163,14 @@ static inline void cpu_latency_qos_update_request(struct pm_qos_request *req,
>  static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) {}
>  #endif
>  
> +s32 cpu_response_frequency_qos_limit(void);

For example

cpu_scaling_response_qos_limit()

> +bool cpu_response_frequency_qos_request_active(struct pm_qos_request *req);

cpu_scaling_response_qos_request_active()

and so on.

> +void cpu_response_frequency_qos_add_request(struct pm_qos_request *req,
> +					    s32 value);
> +void cpu_response_frequency_qos_update_request(struct pm_qos_request *req,
> +					       s32 new_value);
> +void cpu_response_frequency_qos_remove_request(struct pm_qos_request *req);
> +
>  #ifdef CONFIG_PM
>  enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
>  enum pm_qos_flags_status dev_pm_qos_flags(struct device *dev, s32 mask);
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index af5018aa9517..7e4b52e8ca3a 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -359,45 +359,48 @@ DEFINE_EVENT(power_domain, power_domain_target,
>  );
>  
>  /*
> - * CPU latency QoS events used for global CPU latency QoS list updates
> + * CPU latency/response frequency QoS events used for global CPU PM
> + * QoS list updates.
>   */
> -DECLARE_EVENT_CLASS(cpu_latency_qos_request,
> +DECLARE_EVENT_CLASS(pm_qos_request,
>  
> -	TP_PROTO(s32 value),
> +	TP_PROTO(const char *name, s32 value),
>  
> -	TP_ARGS(value),
> +	TP_ARGS(name, value),
>  
>  	TP_STRUCT__entry(
> +		__string(name,			 name		)
>  		__field( s32,                    value          )
>  	),
>  
>  	TP_fast_assign(
> +		__assign_str(name, name);
>  		__entry->value = value;
>  	),
>  
> -	TP_printk("CPU_DMA_LATENCY value=%d",
> -		  __entry->value)
> +	TP_printk("pm_qos_class=%s value=%d",
> +		  __get_str(name), __entry->value)
>  );
>  
> -DEFINE_EVENT(cpu_latency_qos_request, pm_qos_add_request,
> +DEFINE_EVENT(pm_qos_request, pm_qos_add_request,
>  
> -	TP_PROTO(s32 value),
> +	TP_PROTO(const char *name, s32 value),
>  
> -	TP_ARGS(value)
> +	TP_ARGS(name, value)
>  );
>  
> -DEFINE_EVENT(cpu_latency_qos_request, pm_qos_update_request,
> +DEFINE_EVENT(pm_qos_request, pm_qos_update_request,
>  
> -	TP_PROTO(s32 value),
> +	TP_PROTO(const char *name, s32 value),
>  
> -	TP_ARGS(value)
> +	TP_ARGS(name, value)
>  );
>  
> -DEFINE_EVENT(cpu_latency_qos_request, pm_qos_remove_request,
> +DEFINE_EVENT(pm_qos_request, pm_qos_remove_request,
>  
> -	TP_PROTO(s32 value),
> +	TP_PROTO(const char *name, s32 value),
>  
> -	TP_ARGS(value)
> +	TP_ARGS(name, value)
>  );
>  
>  /*
> diff --git a/kernel/power/qos.c b/kernel/power/qos.c
> index 32927682bcc4..49f140aa5aa1 100644
> --- a/kernel/power/qos.c
> +++ b/kernel/power/qos.c
> @@ -271,7 +271,7 @@ void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value)
>  		return;
>  	}
>  
> -	trace_pm_qos_add_request(value);
> +	trace_pm_qos_add_request("CPU_DMA_LATENCY", value);
>  
>  	req->qos = &cpu_latency_constraints;
>  	cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value);
> @@ -297,7 +297,7 @@ void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value)
>  		return;
>  	}
>  
> -	trace_pm_qos_update_request(new_value);
> +	trace_pm_qos_update_request("CPU_DMA_LATENCY", new_value);
>  
>  	if (new_value == req->node.prio)
>  		return;
> @@ -323,7 +323,7 @@ void cpu_latency_qos_remove_request(struct pm_qos_request *req)
>  		return;
>  	}
>  
> -	trace_pm_qos_remove_request(PM_QOS_DEFAULT_VALUE);
> +	trace_pm_qos_remove_request("CPU_DMA_LATENCY", PM_QOS_DEFAULT_VALUE);
>  
>  	cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
>  	memset(req, 0, sizeof(*req));
> @@ -424,6 +424,138 @@ static int __init cpu_latency_qos_init(void)
>  late_initcall(cpu_latency_qos_init);
>  #endif /* CONFIG_CPU_IDLE */
>  
> +/* Definitions related to the CPU response frequency QoS. */
> +
> +static struct pm_qos_constraints cpu_response_frequency_constraints = {
> +	.list = PLIST_HEAD_INIT(cpu_response_frequency_constraints.list),
> +	.target_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
> +	.default_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
> +	.no_constraint_value = PM_QOS_CPU_RESPONSE_FREQUENCY_DEFAULT_VALUE,
> +	.type = PM_QOS_MAX,
> +};
> +
> +/**
> + * cpu_response_frequency_qos_limit - Return current system-wide CPU
> + *				      response frequency QoS limit.
> + */
> +s32 cpu_response_frequency_qos_limit(void)
> +{
> +	return pm_qos_read_value(&cpu_response_frequency_constraints);
> +}
> +EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_limit);
> +
> +/**
> + * cpu_response_frequency_qos_request_active - Check the given PM QoS request.
> + * @req: PM QoS request to check.
> + *
> + * Return: 'true' if @req has been added to the CPU response frequency
> + * QoS list, 'false' otherwise.
> + */
> +bool cpu_response_frequency_qos_request_active(struct pm_qos_request *req)
> +{
> +	return req->qos == &cpu_response_frequency_constraints;
> +}
> +EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_request_active);
> +
> +static void cpu_response_frequency_qos_apply(struct pm_qos_request *req,
> +					     enum pm_qos_req_action action,
> +					     s32 value)
> +{
> +	pm_qos_update_target(req->qos, &req->node, action, value);
> +}
> +
> +/**
> + * cpu_response_frequency_qos_add_request - Add new CPU response
> + *					    frequency QoS request.
> + * @req: Pointer to a preallocated handle.
> + * @value: Requested constraint value.
> + *
> + * Use @value to initialize the request handle pointed to by @req,
> + * insert it as a new entry to the CPU response frequency QoS list and
> + * recompute the effective QoS constraint for that list.
> + *
> + * Callers need to save the handle for later use in updates and removal of the
> + * QoS request represented by it.
> + */
> +void cpu_response_frequency_qos_add_request(struct pm_qos_request *req,
> +					    s32 value)
> +{
> +	if (!req)
> +		return;
> +
> +	if (cpu_response_frequency_qos_request_active(req)) {
> +		WARN(1, KERN_ERR "%s called for already added request\n",
> +		     __func__);
> +		return;
> +	}
> +
> +	trace_pm_qos_add_request("CPU_RESPONSE_FREQUENCY", value);
> +
> +	req->qos = &cpu_response_frequency_constraints;
> +	cpu_response_frequency_qos_apply(req, PM_QOS_ADD_REQ, value);
> +}
> +EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_add_request);
> +
> +/**
> + * cpu_response_frequency_qos_update_request - Modify existing CPU
> + *					       response frequency QoS
> + *					       request.
> + * @req : QoS request to update.
> + * @new_value: New requested constraint value.
> + *
> + * Use @new_value to update the QoS request represented by @req in the
> + * CPU response frequency QoS list along with updating the effective
> + * constraint value for that list.
> + */
> +void cpu_response_frequency_qos_update_request(struct pm_qos_request *req,
> +					       s32 new_value)
> +{
> +	if (!req)
> +		return;
> +
> +	if (!cpu_response_frequency_qos_request_active(req)) {
> +		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
> +		return;
> +	}
> +
> +	trace_pm_qos_update_request("CPU_RESPONSE_FREQUENCY", new_value);
> +
> +	if (new_value == req->node.prio)
> +		return;
> +
> +	cpu_response_frequency_qos_apply(req, PM_QOS_UPDATE_REQ, new_value);
> +}
> +EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_update_request);
> +
> +/**
> + * cpu_response_frequency_qos_remove_request - Remove existing CPU
> + *					       response frequency QoS
> + *					       request.
> + * @req: QoS request to remove.
> + *
> + * Remove the CPU response frequency QoS request represented by @req
> + * from the CPU response frequency QoS list along with updating the
> + * effective constraint value for that list.
> + */
> +void cpu_response_frequency_qos_remove_request(struct pm_qos_request *req)
> +{
> +	if (!req)
> +		return;
> +
> +	if (!cpu_response_frequency_qos_request_active(req)) {
> +		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
> +		return;
> +	}
> +
> +	trace_pm_qos_remove_request("CPU_RESPONSE_FREQUENCY",
> +				    PM_QOS_DEFAULT_VALUE);
> +
> +	cpu_response_frequency_qos_apply(req, PM_QOS_REMOVE_REQ,
> +					 PM_QOS_DEFAULT_VALUE);
> +	memset(req, 0, sizeof(*req));
> +}
> +EXPORT_SYMBOL_GPL(cpu_response_frequency_qos_remove_request);
> +
>  /* Definitions related to the frequency QoS below. */
>  
>  /**
> 




_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
@ 2020-03-19 10:45   ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2020-03-19 10:45 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: Peter Zijlstra, intel-gfx, Pandruvada, Srinivas, linux-pm

On Tuesday, March 10, 2020 10:41:57 PM CET Francisco Jerez wrote:
> This reverts commit c4f3f70cacba2fa19545389a12d09b606d2ad1cf.  A
> future commit will introduce a new update_util implementation, so the
> pstate_funcs table entry is going to be useful.

This basically means that you want to introduce a new scaling algorithm.

In my view that needs to be exposed via scaling_governor so users can
switch over between this and the existing ones (powersave and performance).

That would require the cpufreq core to be updated somewhat to recognize
an additional CPUFREQ_POLICY_ value, but that should be perfectly doable.

And ->

> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> ---
>  drivers/cpufreq/intel_pstate.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 7fa869004cf0..8cb5bf419b40 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -277,6 +277,7 @@ static struct cpudata **all_cpu_data;
>   * @get_scaling:	Callback to get frequency scaling factor
>   * @get_val:		Callback to convert P state to actual MSR write value
>   * @get_vid:		Callback to get VID data for Atom platforms
> + * @update_util:	Active mode utilization update callback.
>   *
>   * Core and Atom CPU models have different way to get P State limits. This
>   * structure is used to store those callbacks.
> @@ -290,6 +291,8 @@ struct pstate_funcs {
>  	int (*get_aperf_mperf_shift)(void);
>  	u64 (*get_val)(struct cpudata*, int pstate);
>  	void (*get_vid)(struct cpudata *);
> +	void (*update_util)(struct update_util_data *data, u64 time,
> +			    unsigned int flags);
>  };
>  
>  static struct pstate_funcs pstate_funcs __read_mostly;
> @@ -1877,6 +1880,7 @@ static struct pstate_funcs core_funcs = {
>  	.get_turbo = core_get_turbo_pstate,
>  	.get_scaling = core_get_scaling,
>  	.get_val = core_get_val,
> +	.update_util = intel_pstate_update_util,
>  };
>  
>  static const struct pstate_funcs silvermont_funcs = {
> @@ -1887,6 +1891,7 @@ static const struct pstate_funcs silvermont_funcs = {
>  	.get_val = atom_get_val,
>  	.get_scaling = silvermont_get_scaling,
>  	.get_vid = atom_get_vid,
> +	.update_util = intel_pstate_update_util,
>  };
>  
>  static const struct pstate_funcs airmont_funcs = {
> @@ -1897,6 +1902,7 @@ static const struct pstate_funcs airmont_funcs = {
>  	.get_val = atom_get_val,
>  	.get_scaling = airmont_get_scaling,
>  	.get_vid = atom_get_vid,
> +	.update_util = intel_pstate_update_util,
>  };
>  
>  static const struct pstate_funcs knl_funcs = {
> @@ -1907,6 +1913,7 @@ static const struct pstate_funcs knl_funcs = {
>  	.get_aperf_mperf_shift = knl_get_aperf_mperf_shift,
>  	.get_scaling = core_get_scaling,
>  	.get_val = core_get_val,
> +	.update_util = intel_pstate_update_util,
>  };
>  
>  #define ICPU(model, policy) \
> @@ -2013,9 +2020,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
>  	/* Prevent intel_pstate_update_util() from using stale data. */
>  	cpu->sample.time = 0;
>  	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
> -				     (hwp_active ?
> -				      intel_pstate_update_util_hwp :
> -				      intel_pstate_update_util));

-> it should be possible to extend this code to install an update_util matching
the scaling algo chosen by the user.

> +				     pstate_funcs.update_util);
>  	cpu->update_util_set = true;
>  }
>  
> @@ -2584,6 +2589,7 @@ static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
>  	pstate_funcs.get_scaling = funcs->get_scaling;
>  	pstate_funcs.get_val   = funcs->get_val;
>  	pstate_funcs.get_vid   = funcs->get_vid;
> +	pstate_funcs.update_util = funcs->update_util;
>  	pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
>  }
>  
> @@ -2750,8 +2756,11 @@ static int __init intel_pstate_init(void)
>  	id = x86_match_cpu(hwp_support_ids);
>  	if (id) {
>  		copy_cpu_funcs(&core_funcs);
> -		if (!no_hwp) {
> +		if (no_hwp) {
> +			pstate_funcs.update_util = intel_pstate_update_util;
> +		} else {
>  			hwp_active++;
> +			pstate_funcs.update_util = intel_pstate_update_util_hwp;
>  			hwp_mode_bdw = id->driver_data;
>  			intel_pstate.attr = hwp_cpufreq_attrs;
>  			goto hwp_cpu_matched;
> 




_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation.
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation Francisco Jerez
@ 2020-03-19 11:06   ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2020-03-19 11:06 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: Peter Zijlstra, intel-gfx, Pandruvada, Srinivas, linux-pm

On Tuesday, March 10, 2020 10:41:58 PM CET Francisco Jerez wrote:
> The goal of the helper code introduced here is to compute two
> informational data structures: struct vlp_input_stats aggregating
> various scheduling and PM statistics gathered in every call of the
> update_util() hook, and struct vlp_status_sample which contains status
> information derived from the former indicating whether the system is
> likely to have an IO or CPU bottleneck.  This will be used as main
> heuristic input by the new variably low-pass filtering controller (AKA
> VLP)

I'm not sure how widely used this is.

It would be good to provide a pointer to a definition of it where all of
the maths is described and the foundation of it is explained.  Alternatively,
document it in the kernel source.

> that will assist the HWP at finding a reasonably energy-efficient
> P-state given the additional information available to the kernel about
> I/O utilization and scheduling behavior.
> 
> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> ---
>  drivers/cpufreq/intel_pstate.c | 230 +++++++++++++++++++++++++++++++++
>  1 file changed, 230 insertions(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 8cb5bf419b40..12ee350db2a9 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -19,6 +19,7 @@
>  #include <linux/list.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
> +#include <linux/debugfs.h>
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/fs.h>
> @@ -33,6 +34,8 @@
>  #include <asm/cpufeature.h>
>  #include <asm/intel-family.h>
>  
> +#include "../../kernel/sched/sched.h"
> +
>  #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
> @@ -59,6 +62,11 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
>  	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
>  }
>  
> +static inline int rnd_fp(int32_t x)

What does md stand for?

> +{
> +	return (x + (1 << (FRAC_BITS - 1))) >> FRAC_BITS;
> +}
> +
>  static inline int32_t div_fp(s64 x, s64 y)
>  {
>  	return div64_s64((int64_t)x << FRAC_BITS, y);
> @@ -169,6 +177,49 @@ struct vid_data {
>  	int32_t ratio;
>  };
>  
> +/**
> + * Scheduling and PM statistics gathered by update_vlp_sample() at
> + * every call of the VLP update_state() hook, used as heuristic
> + * inputs.
> + */
> +struct vlp_input_stats {
> +	int32_t realtime_count;
> +	int32_t io_wait_count;
> +	uint32_t max_response_frequency_hz;
> +	uint32_t last_response_frequency_hz;
> +};
> +
> +enum vlp_status {
> +	VLP_BOTTLENECK_IO = 1 << 0,
> +	/*
> +	 * XXX - Add other status bits here indicating a CPU or TDP
> +	 * bottleneck.
> +	 */
> +};
> +
> +/**
> + * Heuristic status information calculated by get_vlp_status_sample()
> + * from struct vlp_input_stats above, indicating whether the system
> + * has a potential IO or latency bottleneck.
> + */
> +struct vlp_status_sample {
> +	enum vlp_status value;
> +	int32_t realtime_avg;
> +};
> +
> +/**
> + * struct vlp_data - VLP controller parameters and state.
> + * @sample_interval_ns:	 Update interval in ns.
> + * @sample_frequency_hz: Reciprocal of the update interval in Hz.
> + */
> +struct vlp_data {
> +	s64 sample_interval_ns;
> +	int32_t sample_frequency_hz;
> +
> +	struct vlp_input_stats stats;
> +	struct vlp_status_sample status;
> +};
> +
>  /**
>   * struct global_params - Global parameters, mostly tunable via sysfs.
>   * @no_turbo:		Whether or not to use turbo P-states.
> @@ -239,6 +290,7 @@ struct cpudata {
>  
>  	struct pstate_data pstate;
>  	struct vid_data vid;
> +	struct vlp_data vlp;
>  
>  	u64	last_update;
>  	u64	last_sample_time;
> @@ -268,6 +320,18 @@ struct cpudata {
>  
>  static struct cpudata **all_cpu_data;
>  
> +/**
> + * struct vlp_params - VLP controller static configuration
> + * @sample_interval_ms:	     Update interval in ms.
> + * @avg*_hz:		     Exponential averaging frequencies of the various
> + *			     low-pass filters as an integer in Hz.
> + */
> +struct vlp_params {
> +	int sample_interval_ms;
> +	int avg_hz;
> +	int debug;
> +};
> +
>  /**
>   * struct pstate_funcs - Per CPU model specific callbacks
>   * @get_max:		Callback to get maximum non turbo effective P state
> @@ -296,6 +360,11 @@ struct pstate_funcs {
>  };
>  
>  static struct pstate_funcs pstate_funcs __read_mostly;
> +static struct vlp_params vlp_params __read_mostly = {
> +	.sample_interval_ms = 10,
> +	.avg_hz = 2,
> +	.debug = 0,
> +};
>  
>  static int hwp_active __read_mostly;
>  static int hwp_mode_bdw __read_mostly;
> @@ -1793,6 +1862,167 @@ static inline int32_t get_target_pstate(struct cpudata *cpu)
>  	return target;
>  }
>  
> +/**
> + * Initialize the struct vlp_data of the specified CPU to the defaults
> + * calculated from @vlp_params.
> + */

Nit: All of the function header comments need to be in the canonical kerneldoc
format, ie. with arguments listed etc.

> +static void intel_pstate_reset_vlp(struct cpudata *cpu)
> +{
> +	struct vlp_data *vlp = &cpu->vlp;
> +
> +	vlp->sample_interval_ns = vlp_params.sample_interval_ms * NSEC_PER_MSEC;
> +	vlp->sample_frequency_hz = max(1u, (uint32_t)MSEC_PER_SEC /
> +					   vlp_params.sample_interval_ms);
> +	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
> +}
> +
> +/**
> + * Fixed point representation with twice the usual number of
> + * fractional bits.
> + */
> +#define DFRAC_BITS 16
> +#define DFRAC_ONE (1 << DFRAC_BITS)
> +#define DFRAC_MAX_INT (0u - (uint32_t)DFRAC_ONE)
> +
> +/**
> + * Fast but rather inaccurate piecewise-linear approximation of a
> + * fixed-point inverse exponential:
> + *
> + *  exp2n(p) = int_tofp(1) * 2 ^ (-p / DFRAC_ONE) + O(1)
> + *
> + * The error term should be lower in magnitude than 0.044.
> + */
> +static int32_t exp2n(uint32_t p)
> +{
> +	if (p < 32 * DFRAC_ONE) {
> +		/* Interpolate between 2^-floor(p) and 2^-ceil(p). */
> +		const uint32_t floor_p = p >> DFRAC_BITS;
> +		const uint32_t ceil_p = (p + DFRAC_ONE - 1) >> DFRAC_BITS;
> +		const uint64_t frac_p = p - (floor_p << DFRAC_BITS);
> +
> +		return ((int_tofp(1) >> floor_p) * (DFRAC_ONE - frac_p) +
> +			(ceil_p >= 32 ? 0 : int_tofp(1) >> ceil_p) * frac_p) >>
> +			DFRAC_BITS;
> +	}
> +
> +	/* Short-circuit to avoid overflow. */
> +	return 0;
> +}
> +
> +/**
> + * Calculate the exponential averaging weight for a new sample based
> + * on the requested averaging frequency @hz and the delay since the
> + * last update.
> + */
> +static int32_t get_last_sample_avg_weight(struct cpudata *cpu, unsigned int hz)
> +{
> +	/*
> +	 * Approximate, but saves several 64-bit integer divisions
> +	 * below and should be fully evaluated at compile-time.
> +	 * Causes the exponential averaging to have an effective base
> +	 * of 1.90702343749, which has little functional implications
> +	 * as long as the hz parameter is scaled accordingly.
> +	 */
> +	const uint32_t ns_per_s_shift = order_base_2(NSEC_PER_SEC);
> +	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
> +
> +	return exp2n(min((uint64_t)DFRAC_MAX_INT,
> +			 (hz * delta_ns) >> (ns_per_s_shift - DFRAC_BITS)));
> +}
> +
> +/**
> + * Calculate some status information heuristically based on the struct
> + * vlp_input_stats statistics gathered by the update_state() hook.
> + */
> +static const struct vlp_status_sample *get_vlp_status_sample(
> +	struct cpudata *cpu, const int32_t po)
> +{
> +	struct vlp_data *vlp = &cpu->vlp;
> +	struct vlp_input_stats *stats = &vlp->stats;
> +	struct vlp_status_sample *last_status = &vlp->status;
> +
> +	/*
> +	 * Calculate the VLP_BOTTLENECK_IO state bit, which indicates
> +	 * whether some IO device driver has requested a PM response
> +	 * frequency bound, typically due to the device being under
> +	 * close to full utilization, which should cause the
> +	 * controller to make a more conservative trade-off between
> +	 * latency and energy usage, since performance isn't
> +	 * guaranteed to scale further with increasing CPU frequency
> +	 * whenever the system is close to IO-bound.
> +	 *
> +	 * Note that the maximum achievable response frequency is
> +	 * limited by the sampling frequency of the controller,
> +	 * response frequency requests greater than that will be
> +	 * promoted to infinity (i.e. no low-pass filtering) in order
> +	 * to avoid violating the response frequency constraint
> +	 * provided via PM QoS.
> +	 */
> +	const bool bottleneck_io = stats->max_response_frequency_hz <
> +				   vlp->sample_frequency_hz;
> +
> +	/*
> +	 * Calculate the realtime statistic that tracks the
> +	 * exponentially-averaged rate of occurrence of
> +	 * latency-sensitive events (like wake-ups from IO wait).
> +	 */
> +	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
> +	const int32_t realtime_sample =
> +		div_fp((uint64_t)(stats->realtime_count +
> +				  (bottleneck_io ? 0 : stats->io_wait_count)) *
> +		       NSEC_PER_SEC,
> +		       100 * delta_ns);
> +	const int32_t alpha = get_last_sample_avg_weight(cpu,
> +							 vlp_params.avg_hz);
> +	const int32_t realtime_avg = realtime_sample +
> +		mul_fp(alpha, last_status->realtime_avg - realtime_sample);
> +
> +	/* Consume the input statistics. */
> +	stats->io_wait_count = 0;
> +	stats->realtime_count = 0;
> +	if (bottleneck_io)
> +		stats->last_response_frequency_hz =
> +			stats->max_response_frequency_hz;
> +	stats->max_response_frequency_hz = 0;
> +
> +	/* Update the state of the controller. */
> +	last_status->realtime_avg = realtime_avg;
> +	last_status->value = (bottleneck_io ? VLP_BOTTLENECK_IO : 0);
> +
> +	/* Update state used for tracing. */
> +	cpu->sample.busy_scaled = int_tofp(stats->max_response_frequency_hz);
> +	cpu->iowait_boost = realtime_avg;
> +
> +	return last_status;
> +}
> +
> +/**
> + * Collect some scheduling and PM statistics in response to an
> + * update_state() call.
> + */
> +static bool update_vlp_sample(struct cpudata *cpu, u64 time, unsigned int flags)
> +{
> +	struct vlp_input_stats *stats = &cpu->vlp.stats;
> +
> +	/* Update PM QoS request. */
> +	const uint32_t resp_hz = cpu_response_frequency_qos_limit();
> +
> +	stats->max_response_frequency_hz = !resp_hz ? UINT_MAX :
> +		max(stats->max_response_frequency_hz, resp_hz);
> +
> +	/* Update scheduling statistics. */
> +	if ((flags & SCHED_CPUFREQ_IOWAIT))
> +		stats->io_wait_count++;
> +
> +	if (cpu_rq(cpu->cpu)->rt.rt_nr_running)
> +		stats->realtime_count++;
> +
> +	/* Return whether a P-state update is due. */
> +	return smp_processor_id() == cpu->cpu &&
> +		time - cpu->sample.time >= cpu->vlp.sample_interval_ns &&
> +		intel_pstate_sample(cpu, time);
> +}
> +
>  static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
>  {
>  	int min_pstate = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
> 




_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation.
  2020-03-10 21:41 ` [Intel-gfx] [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation Francisco Jerez
@ 2020-03-19 11:12   ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2020-03-19 11:12 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: Peter Zijlstra, intel-gfx, Pandruvada, Srinivas, linux-pm

On Tuesday, March 10, 2020 10:41:59 PM CET Francisco Jerez wrote:
> The function introduced here calculates a P-state range derived from
> the statistics computed in the previous patch which will be used to
> drive the HWP P-state range or (if HWP is not available) as basis for
> some additional kernel-side frequency selection mechanism which will
> choose a single P-state from the range.  This is meant to provide a
> variably low-pass filtering effect that will damp oscillations below a
> frequency threshold that can be specified by device drivers via PM QoS
> in order to achieve energy-efficient behavior in cases where the
> system has an IO bottleneck.
> 
> Signed-off-by: Francisco Jerez <currojerez@riseup.net>

The separation of this patch from the other one appears to be artificial
and it actually makes reviewing them both harder in my perspective.

What would be wrong with merging them together?

> ---
>  drivers/cpufreq/intel_pstate.c | 157 +++++++++++++++++++++++++++++++++
>  1 file changed, 157 insertions(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 12ee350db2a9..cecadfec8bc1 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -207,17 +207,34 @@ struct vlp_status_sample {
>  	int32_t realtime_avg;
>  };
>  
> +/**
> + * VLP controller state used for the estimation of the target P-state
> + * range, computed by get_vlp_target_range() from the heuristic status
> + * information defined above in struct vlp_status_sample.
> + */
> +struct vlp_target_range {
> +	unsigned int value[2];
> +	int32_t p_base;
> +};
> +
>  /**
>   * struct vlp_data - VLP controller parameters and state.
>   * @sample_interval_ns:	 Update interval in ns.
>   * @sample_frequency_hz: Reciprocal of the update interval in Hz.
> + * @gain*:		 Response factor of the controller relative to each
> + *			 one of its linear input variables as fixed-point
> + *			 fraction.
>   */
>  struct vlp_data {
>  	s64 sample_interval_ns;
>  	int32_t sample_frequency_hz;
> +	int32_t gain_aggr;
> +	int32_t gain_rt;
> +	int32_t gain;
>  
>  	struct vlp_input_stats stats;
>  	struct vlp_status_sample status;
> +	struct vlp_target_range target;
>  };
>  
>  /**
> @@ -323,12 +340,18 @@ static struct cpudata **all_cpu_data;
>  /**
>   * struct vlp_params - VLP controller static configuration
>   * @sample_interval_ms:	     Update interval in ms.
> + * @setpoint_*_pml:	     Target CPU utilization at which the controller is
> + *			     expected to leave the current P-state untouched,
> + *			     as an integer per mille.
>   * @avg*_hz:		     Exponential averaging frequencies of the various
>   *			     low-pass filters as an integer in Hz.
>   */
>  struct vlp_params {
>  	int sample_interval_ms;
> +	int setpoint_0_pml;
> +	int setpoint_aggr_pml;
>  	int avg_hz;
> +	int realtime_gain_pml;
>  	int debug;
>  };
>  
> @@ -362,7 +385,10 @@ struct pstate_funcs {
>  static struct pstate_funcs pstate_funcs __read_mostly;
>  static struct vlp_params vlp_params __read_mostly = {
>  	.sample_interval_ms = 10,
> +	.setpoint_0_pml = 900,
> +	.setpoint_aggr_pml = 1500,
>  	.avg_hz = 2,
> +	.realtime_gain_pml = 12000,
>  	.debug = 0,
>  };
>  
> @@ -1873,6 +1899,11 @@ static void intel_pstate_reset_vlp(struct cpudata *cpu)
>  	vlp->sample_interval_ns = vlp_params.sample_interval_ms * NSEC_PER_MSEC;
>  	vlp->sample_frequency_hz = max(1u, (uint32_t)MSEC_PER_SEC /
>  					   vlp_params.sample_interval_ms);
> +	vlp->gain_rt = div_fp(cpu->pstate.max_pstate *
> +			      vlp_params.realtime_gain_pml, 1000);
> +	vlp->gain_aggr = max(1, div_fp(1000, vlp_params.setpoint_aggr_pml));
> +	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
> +	vlp->target.p_base = 0;
>  	vlp->stats.last_response_frequency_hz = vlp_params.avg_hz;
>  }
>  
> @@ -1996,6 +2027,132 @@ static const struct vlp_status_sample *get_vlp_status_sample(
>  	return last_status;
>  }
>  
> +/**
> + * Calculate the target P-state range for the next update period.
> + * Uses a variably low-pass-filtering controller intended to improve
> + * energy efficiency when a CPU response frequency target is specified
> + * via PM QoS (e.g. under IO-bound conditions).
> + */
> +static const struct vlp_target_range *get_vlp_target_range(struct cpudata *cpu)
> +{
> +	struct vlp_data *vlp = &cpu->vlp;
> +	struct vlp_target_range *last_target = &vlp->target;
> +
> +	/*
> +	 * P-state limits in fixed-point as allowed by the policy.
> +	 */
> +	const int32_t p0 = int_tofp(max(cpu->pstate.min_pstate,
> +					cpu->min_perf_ratio));
> +	const int32_t p1 = int_tofp(cpu->max_perf_ratio);
> +
> +	/*
> +	 * Observed average P-state during the sampling period.	 The
> +	 * conservative path (po_cons) uses the TSC increment as
> +	 * denominator which will give the minimum (arguably most
> +	 * energy-efficient) P-state able to accomplish the observed
> +	 * amount of work during the sampling period.
> +	 *
> +	 * The downside of that somewhat optimistic estimate is that
> +	 * it can give a biased result for intermittent
> +	 * latency-sensitive workloads, which may have to be completed
> +	 * in a short window of time for the system to achieve maximum
> +	 * performance, even if the average CPU utilization is low.
> +	 * For that reason the aggressive path (po_aggr) uses the
> +	 * MPERF increment as denominator, which is approximately
> +	 * optimal under the pessimistic assumption that the CPU work
> +	 * cannot be parallelized with any other dependent IO work
> +	 * that subsequently keeps the CPU idle (partly in C1+
> +	 * states).
> +	 */
> +	const int32_t po_cons =
> +		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
> +		       * cpu->pstate.max_pstate_physical,
> +		       cpu->sample.tsc);
> +	const int32_t po_aggr =
> +		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
> +		       * cpu->pstate.max_pstate_physical,
> +		       (cpu->sample.mperf << cpu->aperf_mperf_shift));
> +
> +	const struct vlp_status_sample *status =
> +		get_vlp_status_sample(cpu, po_cons);
> +
> +	/* Calculate the target P-state. */
> +	const int32_t p_tgt_cons = mul_fp(vlp->gain, po_cons);
> +	const int32_t p_tgt_aggr = mul_fp(vlp->gain_aggr, po_aggr);
> +	const int32_t p_tgt = max(p0, min(p1, max(p_tgt_cons, p_tgt_aggr)));
> +
> +	/* Calculate the realtime P-state target lower bound. */
> +	const int32_t pm = int_tofp(cpu->pstate.max_pstate);
> +	const int32_t p_tgt_rt = min(pm, mul_fp(vlp->gain_rt,
> +						status->realtime_avg));
> +
> +	/*
> +	 * Low-pass filter the P-state estimate above by exponential
> +	 * averaging.  For an oscillating workload (e.g. submitting
> +	 * work repeatedly to a device like a soundcard or GPU) this
> +	 * will approximate the minimum P-state that would be able to
> +	 * accomplish the observed amount of work during the averaging
> +	 * period, which is also the optimally energy-efficient one,
> +	 * under the assumptions that:
> +	 *
> +	 *  - The power curve of the system is convex throughout the
> +	 *    range of P-states allowed by the policy. I.e. energy
> +	 *    efficiency is steadily decreasing with frequency past p0
> +	 *    (which is typically close to the maximum-efficiency
> +	 *    ratio).  In practice for the lower range of P-states
> +	 *    this may only be approximately true due to the
> +	 *    interaction between different components of the system.
> +	 *
> +	 *  - Parallelism constraints of the workload don't prevent it
> +	 *    from achieving the same throughput at the lower P-state.
> +	 *    This will happen in cases where the application is
> +	 *    designed in a way that doesn't allow for dependent CPU
> +	 *    and IO jobs to be pipelined, leading to alternating full
> +	 *    and zero utilization of the CPU and IO device.  This
> +	 *    will give an average IO device utilization lower than
> +	 *    100% regardless of the CPU frequency, which should
> +	 *    prevent the device driver from requesting a response
> +	 *    frequency bound, so the filtered P-state calculated
> +	 *    below won't have an influence on the controller
> +	 *    response.
> +	 *
> +	 *  - The period of the oscillating workload is significantly
> +	 *    shorter than the time constant of the exponential
> +	 *    average (1s / last_response_frequency_hz).  Otherwise for
> +	 *    more slowly oscillating workloads the controller
> +	 *    response will roughly follow the oscillation, leading to
> +	 *    decreased energy efficiency.
> +	 *
> +	 *  - The behavior of the workload doesn't change
> +	 *    qualitatively during the next update interval.  This is
> +	 *    only true in the steady state, and could possibly lead
> +	 *    to a transitory period in which the controller response
> +	 *    deviates from the most energy-efficient ratio until the
> +	 *    workload reaches a steady state again.
> +	 */
> +	const int32_t alpha = get_last_sample_avg_weight(
> +		cpu, vlp->stats.last_response_frequency_hz);
> +
> +	last_target->p_base = p_tgt + mul_fp(alpha,
> +					     last_target->p_base - p_tgt);
> +
> +	/*
> +	 * Use the low-pass-filtered controller response for better
> +	 * energy efficiency unless we have reasons to believe that
> +	 * some of the optimality assumptions discussed above may not
> +	 * hold.
> +	 */
> +	if ((status->value & VLP_BOTTLENECK_IO)) {
> +		last_target->value[0] = rnd_fp(p0);
> +		last_target->value[1] = rnd_fp(last_target->p_base);
> +	} else {
> +		last_target->value[0] = rnd_fp(p_tgt_rt);
> +		last_target->value[1] = rnd_fp(p1);
> +	}
> +
> +	return last_target;
> +}
> +
>  /**
>   * Collect some scheduling and PM statistics in response to an
>   * update_state() call.
> 




_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
  2020-03-10 21:42 ` [Intel-gfx] [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID Francisco Jerez
@ 2020-03-19 11:20   ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2020-03-19 11:20 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: Peter Zijlstra, intel-gfx, Pandruvada, Srinivas, linux-pm

On Tuesday, March 10, 2020 10:42:01 PM CET Francisco Jerez wrote:
> For the moment the VLP controller is only enabled on ICL platforms
> other than server FADT profiles in order to reduce the validation
> effort of the initial submission.  It should work on any other
> processors that support HWP though (and soon enough on non-HWP too):
> In order to override the default behavior (e.g. to test on other
> platforms) the VLP controller can be forcefully enabled or disabled by
> passing "intel_pstate=vlp" or "intel_pstate=no_vlp" respectively in
> the kernel command line.
> 
> v2: Handle HWP VLP controller.
> 
> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 ++++
>  Documentation/admin-guide/pm/intel_pstate.rst |  7 ++++++
>  drivers/cpufreq/intel_pstate.c                | 25 +++++++++++++++++--
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0c9894247015..9bc55fc2752e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1828,6 +1828,11 @@
>  			per_cpu_perf_limits
>  			  Allow per-logical-CPU P-State performance control limits using
>  			  cpufreq sysfs interface
> +			vlp
> +			  Force use of VLP P-state controller.  Overrides selection
> +			  derived from ACPI FADT profile.
> +			no_vlp
> +			  Prevent use of VLP P-state controller (see "vlp" parameter).
>  
>  	intremap=	[X86-64, Intel-IOMMU]
>  			on	enable Interrupt Remapping (default)
> diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
> index 67e414e34f37..da6b64812848 100644
> --- a/Documentation/admin-guide/pm/intel_pstate.rst
> +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -669,6 +669,13 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
>  	Use per-logical-CPU P-State limits (see `Coordination of P-state
>  	Limits`_ for details).
>  
> +``vlp``
> +	Force use of VLP P-state controller.  Overrides selection derived
> +	from ACPI FADT profile.
> +
> +``no_vlp``
> +	Prevent use of VLP P-state controller (see "vlp" parameter).
> +
>  
>  Diagnostics and Tuning
>  ======================
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index a01eed40d897..050cc8f03c26 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -3029,6 +3029,7 @@ static int intel_pstate_update_status(const char *buf, size_t size)
>  
>  static int no_load __initdata;
>  static int no_hwp __initdata;
> +static int vlp __initdata = -1;
>  static int hwp_only __initdata;
>  static unsigned int force_load __initdata;
>  
> @@ -3193,6 +3194,7 @@ static inline void intel_pstate_request_control_from_smm(void) {}
>  #endif /* CONFIG_ACPI */
>  
>  #define INTEL_PSTATE_HWP_BROADWELL	0x01
> +#define INTEL_PSTATE_HWP_VLP		0x02
>  
>  #define ICPU_HWP(model, hwp_mode) \
>  	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_HWP, hwp_mode }
> @@ -3200,12 +3202,15 @@ static inline void intel_pstate_request_control_from_smm(void) {}
>  static const struct x86_cpu_id hwp_support_ids[] __initconst = {
>  	ICPU_HWP(INTEL_FAM6_BROADWELL_X, INTEL_PSTATE_HWP_BROADWELL),
>  	ICPU_HWP(INTEL_FAM6_BROADWELL_D, INTEL_PSTATE_HWP_BROADWELL),
> +	ICPU_HWP(INTEL_FAM6_ICELAKE, INTEL_PSTATE_HWP_VLP),
> +	ICPU_HWP(INTEL_FAM6_ICELAKE_L, INTEL_PSTATE_HWP_VLP),
>  	ICPU_HWP(X86_MODEL_ANY, 0),
>  	{}
>  };
>  
>  static int __init intel_pstate_init(void)
>  {
> +	bool use_vlp = vlp == 1;
>  	const struct x86_cpu_id *id;
>  	int rc;
>  
> @@ -3222,8 +3227,19 @@ static int __init intel_pstate_init(void)
>  			pstate_funcs.update_util = intel_pstate_update_util;
>  		} else {
>  			hwp_active++;
> -			pstate_funcs.update_util = intel_pstate_update_util_hwp;
> -			hwp_mode_bdw = id->driver_data;
> +
> +			if (vlp < 0 && !intel_pstate_acpi_pm_profile_server() &&
> +			    (id->driver_data & INTEL_PSTATE_HWP_VLP)) {
> +				/* Enable VLP controller by default. */
> +				use_vlp = true;
> +			}
> +
> +			pstate_funcs.update_util = use_vlp ?
> +				intel_pstate_update_util_hwp_vlp :
> +				intel_pstate_update_util_hwp;

This basically is only good in a prototype in my view.

There is an interface for selecting scaling algorithms in cpufreq already and
in order to avoid confusion, that one needs to be extended instead of adding
extra driver parameters for that.

I'm also a bit concerned about running all of the heavy computations in
the scheduler context.

> +
> +			hwp_mode_bdw = (id->driver_data &
> +					INTEL_PSTATE_HWP_BROADWELL);
>  			intel_pstate.attr = hwp_cpufreq_attrs;
>  			goto hwp_cpu_matched;
>  		}
> @@ -3301,6 +3317,11 @@ static int __init intel_pstate_setup(char *str)
>  	if (!strcmp(str, "per_cpu_perf_limits"))
>  		per_cpu_limits = true;
>  
> +	if (!strcmp(str, "vlp"))
> +		vlp = 1;
> +	if (!strcmp(str, "no_vlp"))
> +		vlp = 0;
> +
>  #ifdef CONFIG_ACPI
>  	if (!strcmp(str, "support_acpi_ppc"))
>  		acpi_ppc = true;
> 




_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-18 19:42       ` Francisco Jerez
@ 2020-03-20  2:46         ` Francisco Jerez
  2020-03-20 10:06           ` Chris Wilson
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-20  2:46 UTC (permalink / raw)
  To: chris.p.wilson, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas


[-- Attachment #1.1.1: Type: text/plain, Size: 5326 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> Francisco Jerez <currojerez@riseup.net> writes:
>
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>>
>>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> index b9b3f78f1324..a5d7a80b826d 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>>         /* we need to manually load the submit queue */
>>>>         if (execlists->ctrl_reg)
>>>>                 writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>>> +
>>>> +       if (execlists_num_ports(execlists) > 1 &&
>>> pending[1] is always defined, the minimum submission is one slot, with
>>> pending[1] as the sentinel NULL.
>>>
>>>> +           execlists->pending[1] &&
>>>> +           !atomic_xchg(&execlists->overload, 1))
>>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>>>
>>> engine->gt
>>>
>>
>> Applied your suggestions above locally, will probably wait to have a few
>> more changes batched up before sending a v2.
>>
>>>>  }
>>>>  
>>>>  static bool ctx_single_port_submission(const struct intel_context *ce)
>>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>>         clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>>  
>>>>         WRITE_ONCE(execlists->active, execlists->inflight);
>>>> +
>>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>>> +               struct intel_engine_cs *engine =
>>>> +                       container_of(execlists, typeof(*engine), execlists);
>>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>>> +       }
>>>>  }
>>>>  
>>>>  static inline void
>>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>>                         /* port0 completed, advanced to port1 */
>>>>                         trace_ports(execlists, "completed", execlists->active);
>>>>  
>>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>>>
>>> So this looses track if we preempt a dual-ELSP submission with a
>>> single-ELSP submission (and never go back to dual).
>>>
>>
>> Yes, good point.  You're right that if a dual-ELSP submission gets
>> preempted by a single-ELSP submission "overload" will remain signaled
>> until the first completion interrupt arrives (e.g. from the preempting
>> submission).
>>
>>> If you move this to the end of the loop and check
>>>
>>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>>> 	intel_gt_pm_active_end(engine->gt);
>>>
>>> so that it covers both preemption/promotion and completion.
>>>
>>
>> That sounds reasonable.
>>
>>> However, that will fluctuate quite rapidly. (And runs the risk of
>>> exceeding the sentinel.)
>>>
>>> An alternative approach would be to couple along
>>> schedule_in/schedule_out
>>>
>>> atomic_set(overload, -1);
>>>
>>> __execlists_schedule_in:
>>> 	if (!atomic_fetch_inc(overload)
>>> 		intel_gt_pm_active_begin(engine->gt);
>>> __execlists_schedule_out:
>>> 	if (!atomic_dec_return(overload)
>>> 		intel_gt_pm_active_end(engine->gt);
>>>
>>> which would mean we are overloaded as soon as we try to submit an
>>> overlapping ELSP.
>>>
>>
>> That sounds good to me too, and AFAICT would have roughly the same
>> behavior as this metric except for the preemption corner case you
>> mention above.  I'll try this and verify that I get approximately the
>> same performance numbers.
>>
>
> This suggestion seems to lead to some minor regressions, I'm
> investigating the issue.  Will send a v2 as soon as I have something
> along the lines of what you suggested running with equivalent
> performance to v1.

I think I've figured out why both of the alternatives we were talking
about above lead to a couple percent regressions in latency-sensitive
workloads: In some scenarios it's possible for execlist_dequeue() to
execute after the GPU has gone idle, but before we've processed the
corresponding CSB entries, particularly when called from the
submit_queue() path.  In that case __execlists_schedule_in() will think
that the next request is overlapping, and tell CPU power management to
relax, even though the GPU is starving intermittently.

How about we do the same:

|       if (atomic_xchg(&execlists->overload, 0))
|               intel_gt_pm_active_end(engine->gt);

as in this patch from process_csb() in response to each completion CSB
entry, which ensures that the system is considered non-GPU-bound as soon
as the first context completes.  Subsequently if another CSB entry
signals a dual-ELSP active-to-idle transition or a dual-ELSP preemption
we call intel_gt_pm_active_begin() directly from process_csb().  If we
hit a single-ELSP preemption CSB entry we call intel_gt_pm_active_end()
instead, in order to avoid the problem you pointed out in your previous
email.

How does that sound to you?  [Still need to verify that it has
comparable performance to this patch overall.]

Thanks!

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
  2020-03-20  2:46         ` Francisco Jerez
@ 2020-03-20 10:06           ` Chris Wilson
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2020-03-20 10:06 UTC (permalink / raw)
  To: Francisco Jerez, intel-gfx, linux-pm
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas

Quoting Francisco Jerez (2020-03-20 02:46:19)
> Francisco Jerez <currojerez@riseup.net> writes:
> 
> > Francisco Jerez <currojerez@riseup.net> writes:
> >
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >>
> >>> Quoting Francisco Jerez (2020-03-10 21:41:55)
> >>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>>> index b9b3f78f1324..a5d7a80b826d 100644
> >>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
> >>>>         /* we need to manually load the submit queue */
> >>>>         if (execlists->ctrl_reg)
> >>>>                 writel(EL_CTRL_LOAD, execlists->ctrl_reg);
> >>>> +
> >>>> +       if (execlists_num_ports(execlists) > 1 &&
> >>> pending[1] is always defined, the minimum submission is one slot, with
> >>> pending[1] as the sentinel NULL.
> >>>
> >>>> +           execlists->pending[1] &&
> >>>> +           !atomic_xchg(&execlists->overload, 1))
> >>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
> >>>
> >>> engine->gt
> >>>
> >>
> >> Applied your suggestions above locally, will probably wait to have a few
> >> more changes batched up before sending a v2.
> >>
> >>>>  }
> >>>>  
> >>>>  static bool ctx_single_port_submission(const struct intel_context *ce)
> >>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
> >>>>         clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
> >>>>  
> >>>>         WRITE_ONCE(execlists->active, execlists->inflight);
> >>>> +
> >>>> +       if (atomic_xchg(&execlists->overload, 0)) {
> >>>> +               struct intel_engine_cs *engine =
> >>>> +                       container_of(execlists, typeof(*engine), execlists);
> >>>> +               intel_gt_pm_active_end(&engine->i915->gt);
> >>>> +       }
> >>>>  }
> >>>>  
> >>>>  static inline void
> >>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
> >>>>                         /* port0 completed, advanced to port1 */
> >>>>                         trace_ports(execlists, "completed", execlists->active);
> >>>>  
> >>>> +                       if (atomic_xchg(&execlists->overload, 0))
> >>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
> >>>
> >>> So this looses track if we preempt a dual-ELSP submission with a
> >>> single-ELSP submission (and never go back to dual).
> >>>
> >>
> >> Yes, good point.  You're right that if a dual-ELSP submission gets
> >> preempted by a single-ELSP submission "overload" will remain signaled
> >> until the first completion interrupt arrives (e.g. from the preempting
> >> submission).
> >>
> >>> If you move this to the end of the loop and check
> >>>
> >>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
> >>>     intel_gt_pm_active_end(engine->gt);
> >>>
> >>> so that it covers both preemption/promotion and completion.
> >>>
> >>
> >> That sounds reasonable.
> >>
> >>> However, that will fluctuate quite rapidly. (And runs the risk of
> >>> exceeding the sentinel.)
> >>>
> >>> An alternative approach would be to couple along
> >>> schedule_in/schedule_out
> >>>
> >>> atomic_set(overload, -1);
> >>>
> >>> __execlists_schedule_in:
> >>>     if (!atomic_fetch_inc(overload)
> >>>             intel_gt_pm_active_begin(engine->gt);
> >>> __execlists_schedule_out:
> >>>     if (!atomic_dec_return(overload)
> >>>             intel_gt_pm_active_end(engine->gt);
> >>>
> >>> which would mean we are overloaded as soon as we try to submit an
> >>> overlapping ELSP.
> >>>
> >>
> >> That sounds good to me too, and AFAICT would have roughly the same
> >> behavior as this metric except for the preemption corner case you
> >> mention above.  I'll try this and verify that I get approximately the
> >> same performance numbers.
> >>
> >
> > This suggestion seems to lead to some minor regressions, I'm
> > investigating the issue.  Will send a v2 as soon as I have something
> > along the lines of what you suggested running with equivalent
> > performance to v1.
> 
> I think I've figured out why both of the alternatives we were talking
> about above lead to a couple percent regressions in latency-sensitive
> workloads: In some scenarios it's possible for execlist_dequeue() to
> execute after the GPU has gone idle, but before we've processed the
> corresponding CSB entries, particularly when called from the
> submit_queue() path.  In that case __execlists_schedule_in() will think
> that the next request is overlapping, and tell CPU power management to
> relax, even though the GPU is starving intermittently.
> 
> How about we do the same:
> 
> |       if (atomic_xchg(&execlists->overload, 0))
> |               intel_gt_pm_active_end(engine->gt);
> 
> as in this patch from process_csb() in response to each completion CSB
> entry, which ensures that the system is considered non-GPU-bound as soon
> as the first context completes.  Subsequently if another CSB entry
> signals a dual-ELSP active-to-idle transition or a dual-ELSP preemption
> we call intel_gt_pm_active_begin() directly from process_csb().  If we
> hit a single-ELSP preemption CSB entry we call intel_gt_pm_active_end()
> instead, in order to avoid the problem you pointed out in your previous
> email.
> 
> How does that sound to you?  [Still need to verify that it has
> comparable performance to this patch overall.]

Sounds like we're trying to compensate for ksoftirqd latency, which is a
killer overall. How about something as simple as

execlists_submit_ports:
	tasklet_hi_schedule(&execlists->tasklet);

which will then be run immediately from local context at the end of the
direct submission... Unless it's already queued on another CPU. Instead
of waiting for that, we may manually try to kick it locally.

As your latency governor is kicked from a worker, iirc, we should still
be executing before it has a chance to process a partial update. I hope.

Anyway if it is the ksoftirqd latency hurting here, it's not a problem
uniquely to the governor and I would like to improve it :)
-Chris
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-03-18 20:22         ` Francisco Jerez
@ 2020-03-23 20:13           ` Pandruvada, Srinivas
  0 siblings, 0 replies; 44+ messages in thread
From: Pandruvada, Srinivas @ 2020-03-23 20:13 UTC (permalink / raw)
  To: linux-pm, currojerez, intel-gfx; +Cc: peterz, rjw

On Wed, 2020-03-18 at 13:22 -0700, Francisco Jerez wrote:
> "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
> 
> > On Wed, 2020-03-18 at 12:51 -0700, Francisco Jerez wrote:
> > > "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
> > > 
> > > > On Tue, 2020-03-10 at 14:42 -0700, Francisco Jerez wrote:
> > > > > This implements a simple variably low-pass-filtering governor
> > > > > in
> > > > > control of the HWP MIN/MAX PERF range based on the previously
> > > > > introduced get_vlp_target_range().  See "cpufreq:
> > > > > intel_pstate:
> > > > > Implement VLP controller target P-state range estimation."
> > > > > for
> > > > > the
> > > > > rationale.
> > > > 
> > > > I just gave a try on a pretty idle system with just systemd
> > > > processes
> > > > and usual background tasks with nomodset. 
> > > > 
> > > > I see that there HWP min is getting changed between 4-8. Why
> > > > are
> > > > changing HWP dynamic range even on an idle system running no
> > > > where
> > > > close to TDP?
> > > > 
> > > 
> > > The HWP request range is clamped to the frequency range specified
> > > by
> > > the
> > > CPUFREQ policy and to the cpu->pstate.min_pstate bound.
> > > 
> > > If you see the HWP minimum fluctuating above that it's likely a
> > > sign
> > > of
> > > your system not being completely idle -- If that's the case it's
> > > likely
> > > to go away after you do:
> > > 
> > >  echo 0 > /sys/kernel/debug/pstate_snb/vlp_realtime_gain_pml
> > > 
> > The objective which I though was to improve performance of GPU
> > workloads limited by TDP because of P-states ramping up and
> > resulting
> > in less power to GPU to complete a task.
> >  
> > HWP takes decision not on just load on a CPU but several other
> > factors
> > like total SoC power and scalability. We don't want to disturb HWP
> > algorithms when there is no TDP limitations. If writing 0, causes
> > this
> > behavior then that should be the default.
> > 
> 
> The heuristic disabled by that debugfs file is there to avoid
> regressions in latency-sensitive workloads as you can probably get
> from
> the ecomments.  However ISTR those regressions were specific to non-
> HWP
> systems, so I wouldn't mind disabling it for the moment (or punting
> it
> to the non-HWP series if you like)j.  But first I need to verify that
> there are no performance regressions on HWP systems after changing
> that.
> Can you confirm that the debugfs write above prevents the behavior
> you'd
> like to avoid?
It does prevent. I monitored for 10 min and didn't see any hwp_req
update.

Thanks,
Srinivas

> 
> > Thanks,
> > Srinivas
> > 
> > 
> > 
> > 
> > 
> > > > Thanks,
> > > > Srinivas
> > > > 
> > > > 
> > > > > Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> > > > > ---
> > > > >  drivers/cpufreq/intel_pstate.c | 79
> > > > > +++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 77 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > > > b/drivers/cpufreq/intel_pstate.c
> > > > > index cecadfec8bc1..a01eed40d897 100644
> > > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > > @@ -1905,6 +1905,20 @@ static void
> > > > > intel_pstate_reset_vlp(struct
> > > > > cpudata *cpu)
> > > > >  	vlp->gain = max(1, div_fp(1000,
> > > > > vlp_params.setpoint_0_pml));
> > > > >  	vlp->target.p_base = 0;
> > > > >  	vlp->stats.last_response_frequency_hz =
> > > > > vlp_params.avg_hz;
> > > > > +
> > > > > +	if (hwp_active) {
> > > > > +		const uint32_t p0 = max(cpu->pstate.min_pstate,
> > > > > +					cpu->min_perf_ratio);
> > > > > +		const uint32_t p1 = max_t(uint32_t, p0, cpu-
> > > > > > max_perf_ratio);
> > > > > +		const uint64_t hwp_req = (READ_ONCE(cpu-
> > > > > > hwp_req_cached) &
> > > > > +					  ~(HWP_MAX_PERF(~0L) |
> > > > > +					    HWP_MIN_PERF(~0L) |
> > > > > +					    HWP_DESIRED_PERF(~0
> > > > > L))) |
> > > > > +					 HWP_MIN_PERF(p0) |
> > > > > HWP_MAX_PERF(p1);
> > > > > +
> > > > > +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST,
> > > > > hwp_req);
> > > > > +		cpu->hwp_req_cached = hwp_req;
> > > > > +	}
> > > > >  }
> > > > >  
> > > > >  /**
> > > > > @@ -2222,6 +2236,46 @@ static void
> > > > > intel_pstate_adjust_pstate(struct
> > > > > cpudata *cpu)
> > > > >  		fp_toint(cpu->iowait_boost * 100));
> > > > >  }
> > > > >  
> > > > > +static void intel_pstate_adjust_pstate_range(struct cpudata
> > > > > *cpu,
> > > > > +					     const unsigned int
> > > > > range[])
> > > > > +{
> > > > > +	const int from = cpu->hwp_req_cached;
> > > > > +	unsigned int p0, p1, p_min, p_max;
> > > > > +	struct sample *sample;
> > > > > +	uint64_t hwp_req;
> > > > > +
> > > > > +	update_turbo_state();
> > > > > +
> > > > > +	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
> > > > > +	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
> > > > > +	p_min = clamp_t(unsigned int, range[0], p0, p1);
> > > > > +	p_max = clamp_t(unsigned int, range[1], p0, p1);
> > > > > +
> > > > > +	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu-
> > > > > >cpu);
> > > > > +
> > > > > +	hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
> > > > > +		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
> > > > > +		     HWP_DESIRED_PERF(~0L))) |
> > > > > +		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 :
> > > > > p_min) |
> > > > > +		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 :
> > > > > p_max);
> > > > > +
> > > > > +	if (hwp_req != cpu->hwp_req_cached) {
> > > > > +		wrmsrl(MSR_HWP_REQUEST, hwp_req);
> > > > > +		cpu->hwp_req_cached = hwp_req;
> > > > > +	}
> > > > > +
> > > > > +	sample = &cpu->sample;
> > > > > +	trace_pstate_sample(mul_ext_fp(100, sample-
> > > > > >core_avg_perf),
> > > > > +			    fp_toint(sample->busy_scaled),
> > > > > +			    from,
> > > > > +			    hwp_req,
> > > > > +			    sample->mperf,
> > > > > +			    sample->aperf,
> > > > > +			    sample->tsc,
> > > > > +			    get_avg_frequency(cpu),
> > > > > +			    fp_toint(cpu->iowait_boost * 100));
> > > > > +}
> > > > > +
> > > > >  static void intel_pstate_update_util(struct update_util_data
> > > > > *data,
> > > > > u64 time,
> > > > >  				     unsigned int flags)
> > > > >  {
> > > > > @@ -2260,6 +2314,22 @@ static void
> > > > > intel_pstate_update_util(struct
> > > > > update_util_data *data, u64 time,
> > > > >  		intel_pstate_adjust_pstate(cpu);
> > > > >  }
> > > > >  
> > > > > +/**
> > > > > + * Implementation of the cpufreq update_util hook based on
> > > > > the
> > > > > VLP
> > > > > + * controller (see get_vlp_target_range()).
> > > > > + */
> > > > > +static void intel_pstate_update_util_hwp_vlp(struct
> > > > > update_util_data
> > > > > *data,
> > > > > +					     u64 time, unsigned
> > > > > int
> > > > > flags)
> > > > > +{
> > > > > +	struct cpudata *cpu = container_of(data, struct
> > > > > cpudata,
> > > > > update_util);
> > > > > +
> > > > > +	if (update_vlp_sample(cpu, time, flags)) {
> > > > > +		const struct vlp_target_range *target =
> > > > > +			get_vlp_target_range(cpu);
> > > > > +		intel_pstate_adjust_pstate_range(cpu, target-
> > > > > >value);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >  static struct pstate_funcs core_funcs = {
> > > > >  	.get_max = core_get_max_pstate,
> > > > >  	.get_max_physical = core_get_max_pstate_physical,
> > > > > @@ -2389,6 +2459,9 @@ static int
> > > > > intel_pstate_init_cpu(unsigned
> > > > > int
> > > > > cpunum)
> > > > >  
> > > > >  	intel_pstate_get_cpu_pstates(cpu);
> > > > >  
> > > > > +	if (pstate_funcs.update_util ==
> > > > > intel_pstate_update_util_hwp_vlp)
> > > > > +		intel_pstate_reset_vlp(cpu);
> > > > > +
> > > > >  	pr_debug("controlling: cpu %d\n", cpunum);
> > > > >  
> > > > >  	return 0;
> > > > > @@ -2398,7 +2471,8 @@ static void
> > > > > intel_pstate_set_update_util_hook(unsigned int cpu_num)
> > > > >  {
> > > > >  	struct cpudata *cpu = all_cpu_data[cpu_num];
> > > > >  
> > > > > -	if (hwp_active && !hwp_boost)
> > > > > +	if (hwp_active && !hwp_boost &&
> > > > > +	    pstate_funcs.update_util !=
> > > > > intel_pstate_update_util_hwp_vlp)
> > > > >  		return;
> > > > >  
> > > > >  	if (cpu->update_util_set)
> > > > > @@ -2526,7 +2600,8 @@ static int
> > > > > intel_pstate_set_policy(struct
> > > > > cpufreq_policy *policy)
> > > > >  		 * was turned off, in that case we need to
> > > > > clear the
> > > > >  		 * update util hook.
> > > > >  		 */
> > > > > -		if (!hwp_boost)
> > > > > +		if (!hwp_boost && pstate_funcs.update_util !=
> > > > > +				  intel_pstate_update_util_hwp_
> > > > > vlp)
> > > > >  			intel_pstate_clear_update_util_hook(pol
> > > > > icy-
> > > > > > cpu);
> > > > >  		intel_pstate_hwp_set(policy->cpu);
> > > > >  	}
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
                   ` (13 preceding siblings ...)
  2020-03-12  2:32 ` Patchwork
@ 2020-03-23 23:29 ` Pandruvada, Srinivas
  2020-03-24  0:23   ` Francisco Jerez
  14 siblings, 1 reply; 44+ messages in thread
From: Pandruvada, Srinivas @ 2020-03-23 23:29 UTC (permalink / raw)
  To: Brown, Len, linux-pm, currojerez, intel-gfx; +Cc: peterz, rjw

Hi Francisco,

On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
> This is my second take on improving the energy efficiency of the
> intel_pstate driver under IO-bound conditions.  The problem and
> approach to solve it are roughly the same as in my previous series
> [1]
> at a high level:
> 
> In IO-bound scenarios (by definition) the throughput of the system
> doesn't improve with increasing CPU frequency beyond the threshold
> value at which the IO device becomes the bottleneck, however with the
> current governors (whether HWP is in use or not) the CPU frequency
> tends to oscillate with the load, often with an amplitude far into
> the
> turbo range, leading to severely reduced energy efficiency, which is
> particularly problematic when a limited TDP budget is shared among a
> number of cores running some multithreaded workload, or among a CPU
> core and an integrated GPU.
> 
> Improving the energy efficiency of the CPU improves the throughput of
> the system in such TDP-limited conditions.  See [4] for some
> preliminary benchmark results from a Razer Blade Stealth 13 Late
> 2019/LY320 laptop with an Intel ICL processor and integrated
> graphics,
> including throughput results that range up to a ~15% improvement and
> performance-per-watt results up to a ~43% improvement (estimated via
> RAPL).  Particularly the throughput results may vary substantially
> from one platform to another depending on the TDP budget and the
> balance of load between CPU and GPU.
> 

You changed the EPP to 0 intentionally or unintentionally. We know that
all energy optimization will be disabled with this change. 
This test was done on an ICL system.


Basically without your patches on top of linux-next: EPP = 0x80
$sudo rdmsr -a 0x774
80002704
80002704
80002704
80002704
80002704
80002704
80002704
80002704


After your patches

$sudo rdmsr -a 0x774
2704
2704
2704
2704
2704
2704
2704
2704

I added some prints, basically you change the EPP at startup before
regular HWP request update path and update on top. So boot up EPP is
overwritten.


[    5.867476] intel_pstate_reset_vlp hwp_req cached:0
[    5.872426] intel_pstate_reset_vlp hwp_req:404
[    5.881645] intel_pstate_reset_vlp hwp_req cached:0
[    5.886634] intel_pstate_reset_vlp hwp_req:404
[    5.895819] intel_pstate_reset_vlp hwp_req cached:0
[    5.900958] intel_pstate_reset_vlp hwp_req:404
[    5.910321] intel_pstate_reset_vlp hwp_req cached:0
[    5.915406] intel_pstate_reset_vlp hwp_req:404
[    5.924623] intel_pstate_reset_vlp hwp_req cached:0
[    5.929564] intel_pstate_reset_vlp hwp_req:404
[    5.944039] intel_pstate_reset_vlp hwp_req cached:0
[    5.951672] intel_pstate_reset_vlp hwp_req:404
[    5.966157] intel_pstate_reset_vlp hwp_req cached:0
[    5.973808] intel_pstate_reset_vlp hwp_req:404
[    5.988223] intel_pstate_reset_vlp hwp_req cached:0
[    5.995823] intel_pstate_reset_vlp hwp_req:404
[    6.010062] intel_pstate: HWP enabled

Thanks,
Srinivas



> One of the main differences relative to my previous version is that
> the trade-off between energy efficiency and frequency ramp-up latency
> is now exposed to device drivers through a new PM QoS class [It would
> make sense to expose it to userspace too eventually but that's beyond
> the purpose of this series].  The new PM QoS class provides a latency
> target to CPUFREQ governors which gives them permission to filter out
> CPU frequency oscillations with a period significantly shorter than
> the specified target, whenever doing so leads to improved energy
> efficiency.
> 
> This series takes advantage of the new PM QoS class from the i915
> driver whenever the driver determines that the GPU has become a
> bottleneck for an extended period of time.  At that point it places a
> PM QoS ramp-up latency target which causes CPUFREQ to limit the CPU
> to
> a reasonably energy-efficient frequency able to at least achieve the
> required amount of work in a time window approximately equal to the
> ramp-up latency target (since any longer-term energy efficiency
> optimization would potentially violate the latency target).  This
> seems more effective than clamping the CPU frequency to a fixed value
> directly from various subsystems, since the CPU is a shared resource,
> so the frequency bound needs to consider the load and latency
> requirements of all independent workloads running on the same CPU
> core
> in order to avoid performance degradation in a multitasking, possibly
> virtualized environment.
> 
> The main limitation of this PM QoS approach is that whenever multiple
> clients request different ramp-up latency targets, only the strictest
> (lowest latency) one will apply system-wide, potentially leading to
> suboptimal energy efficiency for the less latency-sensitive clients,
> (though it won't artificially limit the CPU throughput of the most
> latency-sensitive clients as a result of the PM QoS requests placed
> by
> less latency-sensitive ones).  In order to address this limitation
> I'm
> working on a more complicated solution which integrates with the task
> scheduler in order to provide response latency control with process
> granularity (pretty much in the spirit of PELT).  One of the
> alternatives Rafael and I were discussing was to expose that through
> a
> third cgroup clamp on top of the MIN and MAX utilization clamps, but
> I'm open to any other possibilities regarding what the interface
> should look like.  Either way the current (scheduling-unaware) PM
> QoS-based interface should provide most of the benefit except in
> heavily multitasking environments.
> 
> A branch with this series in testable form can be found here [2],
> based on linux-next from a few days ago.  Another important
> difference
> with respect to my previous revision is that the present one targets
> HWP systems (though for the moment it's only enabled by default on
> ICL, even though that can be overridden through the kernel command
> line).  I have WIP code that uses the same governor in order to
> provide a similar benefit on non-HWP systems (like my previous
> revision), which can be found in this branch for reference [3] -- I'm
> planning to finish that up and send it as follow-up to this series
> assuming people are happy with the overall approach.
> 
> Thanks in advance for any review feed-back and test reports.
> 
> [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
> limit.
> [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU
> load.
> [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters
> via debugfs.
> [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from
> pstate_funcs"
> [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
> statistics and status calculation.
> [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target
> P-state range estimation.
> [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP
> parts.
> [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on
> ACPI FADT profile and CPUID.
> [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP
> controller status.
> [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller
> parameters via debugfs.
> 
> [1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
> [2] 
> https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
> [3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
> [4] 
> http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-23 23:29 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
@ 2020-03-24  0:23   ` Francisco Jerez
  2020-03-24 19:16     ` Francisco Jerez
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-24  0:23 UTC (permalink / raw)
  To: Pandruvada, Srinivas, Brown, Len, linux-pm, intel-gfx; +Cc: peterz, rjw


[-- Attachment #1.1.1: Type: text/plain, Size: 8285 bytes --]

"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:

> Hi Francisco,
>
> On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
>> This is my second take on improving the energy efficiency of the
>> intel_pstate driver under IO-bound conditions.  The problem and
>> approach to solve it are roughly the same as in my previous series
>> [1]
>> at a high level:
>> 
>> In IO-bound scenarios (by definition) the throughput of the system
>> doesn't improve with increasing CPU frequency beyond the threshold
>> value at which the IO device becomes the bottleneck, however with the
>> current governors (whether HWP is in use or not) the CPU frequency
>> tends to oscillate with the load, often with an amplitude far into
>> the
>> turbo range, leading to severely reduced energy efficiency, which is
>> particularly problematic when a limited TDP budget is shared among a
>> number of cores running some multithreaded workload, or among a CPU
>> core and an integrated GPU.
>> 
>> Improving the energy efficiency of the CPU improves the throughput of
>> the system in such TDP-limited conditions.  See [4] for some
>> preliminary benchmark results from a Razer Blade Stealth 13 Late
>> 2019/LY320 laptop with an Intel ICL processor and integrated
>> graphics,
>> including throughput results that range up to a ~15% improvement and
>> performance-per-watt results up to a ~43% improvement (estimated via
>> RAPL).  Particularly the throughput results may vary substantially
>> from one platform to another depending on the TDP budget and the
>> balance of load between CPU and GPU.
>> 
>
> You changed the EPP to 0 intentionally or unintentionally. We know that
> all energy optimization will be disabled with this change. 
> This test was done on an ICL system.
>

Hmm, that's bad, and fully unintentional.  It's probably a side effect
of intel_pstate_reset_vlp() running before intel_pstate_hwp_set(), which
could cause it to use an uninitialized value of hwp_req_cached (zero?).
I'll fix it in v3.  Thanks a lot for pointing this out.

>
> Basically without your patches on top of linux-next: EPP = 0x80
> $sudo rdmsr -a 0x774
> 80002704
> 80002704
> 80002704
> 80002704
> 80002704
> 80002704
> 80002704
> 80002704
>
>
> After your patches
>
> $sudo rdmsr -a 0x774
> 2704
> 2704
> 2704
> 2704
> 2704
> 2704
> 2704
> 2704
>
> I added some prints, basically you change the EPP at startup before
> regular HWP request update path and update on top. So boot up EPP is
> overwritten.
>
>
> [    5.867476] intel_pstate_reset_vlp hwp_req cached:0
> [    5.872426] intel_pstate_reset_vlp hwp_req:404
> [    5.881645] intel_pstate_reset_vlp hwp_req cached:0
> [    5.886634] intel_pstate_reset_vlp hwp_req:404
> [    5.895819] intel_pstate_reset_vlp hwp_req cached:0
> [    5.900958] intel_pstate_reset_vlp hwp_req:404
> [    5.910321] intel_pstate_reset_vlp hwp_req cached:0
> [    5.915406] intel_pstate_reset_vlp hwp_req:404
> [    5.924623] intel_pstate_reset_vlp hwp_req cached:0
> [    5.929564] intel_pstate_reset_vlp hwp_req:404
> [    5.944039] intel_pstate_reset_vlp hwp_req cached:0
> [    5.951672] intel_pstate_reset_vlp hwp_req:404
> [    5.966157] intel_pstate_reset_vlp hwp_req cached:0
> [    5.973808] intel_pstate_reset_vlp hwp_req:404
> [    5.988223] intel_pstate_reset_vlp hwp_req cached:0
> [    5.995823] intel_pstate_reset_vlp hwp_req:404
> [    6.010062] intel_pstate: HWP enabled
>
> Thanks,
> Srinivas
>
>
>
>> One of the main differences relative to my previous version is that
>> the trade-off between energy efficiency and frequency ramp-up latency
>> is now exposed to device drivers through a new PM QoS class [It would
>> make sense to expose it to userspace too eventually but that's beyond
>> the purpose of this series].  The new PM QoS class provides a latency
>> target to CPUFREQ governors which gives them permission to filter out
>> CPU frequency oscillations with a period significantly shorter than
>> the specified target, whenever doing so leads to improved energy
>> efficiency.
>> 
>> This series takes advantage of the new PM QoS class from the i915
>> driver whenever the driver determines that the GPU has become a
>> bottleneck for an extended period of time.  At that point it places a
>> PM QoS ramp-up latency target which causes CPUFREQ to limit the CPU
>> to
>> a reasonably energy-efficient frequency able to at least achieve the
>> required amount of work in a time window approximately equal to the
>> ramp-up latency target (since any longer-term energy efficiency
>> optimization would potentially violate the latency target).  This
>> seems more effective than clamping the CPU frequency to a fixed value
>> directly from various subsystems, since the CPU is a shared resource,
>> so the frequency bound needs to consider the load and latency
>> requirements of all independent workloads running on the same CPU
>> core
>> in order to avoid performance degradation in a multitasking, possibly
>> virtualized environment.
>> 
>> The main limitation of this PM QoS approach is that whenever multiple
>> clients request different ramp-up latency targets, only the strictest
>> (lowest latency) one will apply system-wide, potentially leading to
>> suboptimal energy efficiency for the less latency-sensitive clients,
>> (though it won't artificially limit the CPU throughput of the most
>> latency-sensitive clients as a result of the PM QoS requests placed
>> by
>> less latency-sensitive ones).  In order to address this limitation
>> I'm
>> working on a more complicated solution which integrates with the task
>> scheduler in order to provide response latency control with process
>> granularity (pretty much in the spirit of PELT).  One of the
>> alternatives Rafael and I were discussing was to expose that through
>> a
>> third cgroup clamp on top of the MIN and MAX utilization clamps, but
>> I'm open to any other possibilities regarding what the interface
>> should look like.  Either way the current (scheduling-unaware) PM
>> QoS-based interface should provide most of the benefit except in
>> heavily multitasking environments.
>> 
>> A branch with this series in testable form can be found here [2],
>> based on linux-next from a few days ago.  Another important
>> difference
>> with respect to my previous revision is that the present one targets
>> HWP systems (though for the moment it's only enabled by default on
>> ICL, even though that can be overridden through the kernel command
>> line).  I have WIP code that uses the same governor in order to
>> provide a similar benefit on non-HWP systems (like my previous
>> revision), which can be found in this branch for reference [3] -- I'm
>> planning to finish that up and send it as follow-up to this series
>> assuming people are happy with the overall approach.
>> 
>> Thanks in advance for any review feed-back and test reports.
>> 
>> [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
>> limit.
>> [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU
>> load.
>> [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters
>> via debugfs.
>> [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from
>> pstate_funcs"
>> [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
>> statistics and status calculation.
>> [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target
>> P-state range estimation.
>> [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP
>> parts.
>> [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on
>> ACPI FADT profile and CPUID.
>> [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP
>> controller status.
>> [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller
>> parameters via debugfs.
>> 
>> [1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
>> [2] 
>> https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
>> [3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
>> [4] 
>> http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log
>> 

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-24  0:23   ` Francisco Jerez
@ 2020-03-24 19:16     ` Francisco Jerez
  2020-03-24 20:03       ` Pandruvada, Srinivas
  0 siblings, 1 reply; 44+ messages in thread
From: Francisco Jerez @ 2020-03-24 19:16 UTC (permalink / raw)
  To: Pandruvada, Srinivas, Brown, Len, linux-pm, intel-gfx; +Cc: peterz, rjw


[-- Attachment #1.1.1: Type: text/plain, Size: 9131 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
>
>> Hi Francisco,
>>
>> On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
>>> This is my second take on improving the energy efficiency of the
>>> intel_pstate driver under IO-bound conditions.  The problem and
>>> approach to solve it are roughly the same as in my previous series
>>> [1]
>>> at a high level:
>>> 
>>> In IO-bound scenarios (by definition) the throughput of the system
>>> doesn't improve with increasing CPU frequency beyond the threshold
>>> value at which the IO device becomes the bottleneck, however with the
>>> current governors (whether HWP is in use or not) the CPU frequency
>>> tends to oscillate with the load, often with an amplitude far into
>>> the
>>> turbo range, leading to severely reduced energy efficiency, which is
>>> particularly problematic when a limited TDP budget is shared among a
>>> number of cores running some multithreaded workload, or among a CPU
>>> core and an integrated GPU.
>>> 
>>> Improving the energy efficiency of the CPU improves the throughput of
>>> the system in such TDP-limited conditions.  See [4] for some
>>> preliminary benchmark results from a Razer Blade Stealth 13 Late
>>> 2019/LY320 laptop with an Intel ICL processor and integrated
>>> graphics,
>>> including throughput results that range up to a ~15% improvement and
>>> performance-per-watt results up to a ~43% improvement (estimated via
>>> RAPL).  Particularly the throughput results may vary substantially
>>> from one platform to another depending on the TDP budget and the
>>> balance of load between CPU and GPU.
>>> 
>>
>> You changed the EPP to 0 intentionally or unintentionally. We know that
>> all energy optimization will be disabled with this change. 
>> This test was done on an ICL system.
>>
>
> Hmm, that's bad, and fully unintentional.  It's probably a side effect
> of intel_pstate_reset_vlp() running before intel_pstate_hwp_set(), which
> could cause it to use an uninitialized value of hwp_req_cached (zero?).
> I'll fix it in v3.  Thanks a lot for pointing this out.
>

Sigh.  That means that the performance results I got were inadvertently
obtained while using an EPP setting of "performance" (!).  That's
unlikely to be the case in most systems but still kind of meaningful.
Need to get updated performance numbers with EPP=0x80 -- The larger up
to ~40% energy efficiency improvements still seem to be visible
regardless, but the throughput benefit is likely to be lower than with
EPP=0.

>>
>> Basically without your patches on top of linux-next: EPP = 0x80
>> $sudo rdmsr -a 0x774
>> 80002704
>> 80002704
>> 80002704
>> 80002704
>> 80002704
>> 80002704
>> 80002704
>> 80002704
>>
>>
>> After your patches
>>
>> $sudo rdmsr -a 0x774
>> 2704
>> 2704
>> 2704
>> 2704
>> 2704
>> 2704
>> 2704
>> 2704
>>
>> I added some prints, basically you change the EPP at startup before
>> regular HWP request update path and update on top. So boot up EPP is
>> overwritten.
>>
>>
>> [    5.867476] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.872426] intel_pstate_reset_vlp hwp_req:404
>> [    5.881645] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.886634] intel_pstate_reset_vlp hwp_req:404
>> [    5.895819] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.900958] intel_pstate_reset_vlp hwp_req:404
>> [    5.910321] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.915406] intel_pstate_reset_vlp hwp_req:404
>> [    5.924623] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.929564] intel_pstate_reset_vlp hwp_req:404
>> [    5.944039] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.951672] intel_pstate_reset_vlp hwp_req:404
>> [    5.966157] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.973808] intel_pstate_reset_vlp hwp_req:404
>> [    5.988223] intel_pstate_reset_vlp hwp_req cached:0
>> [    5.995823] intel_pstate_reset_vlp hwp_req:404
>> [    6.010062] intel_pstate: HWP enabled
>>
>> Thanks,
>> Srinivas
>>
>>
>>
>>> One of the main differences relative to my previous version is that
>>> the trade-off between energy efficiency and frequency ramp-up latency
>>> is now exposed to device drivers through a new PM QoS class [It would
>>> make sense to expose it to userspace too eventually but that's beyond
>>> the purpose of this series].  The new PM QoS class provides a latency
>>> target to CPUFREQ governors which gives them permission to filter out
>>> CPU frequency oscillations with a period significantly shorter than
>>> the specified target, whenever doing so leads to improved energy
>>> efficiency.
>>> 
>>> This series takes advantage of the new PM QoS class from the i915
>>> driver whenever the driver determines that the GPU has become a
>>> bottleneck for an extended period of time.  At that point it places a
>>> PM QoS ramp-up latency target which causes CPUFREQ to limit the CPU
>>> to
>>> a reasonably energy-efficient frequency able to at least achieve the
>>> required amount of work in a time window approximately equal to the
>>> ramp-up latency target (since any longer-term energy efficiency
>>> optimization would potentially violate the latency target).  This
>>> seems more effective than clamping the CPU frequency to a fixed value
>>> directly from various subsystems, since the CPU is a shared resource,
>>> so the frequency bound needs to consider the load and latency
>>> requirements of all independent workloads running on the same CPU
>>> core
>>> in order to avoid performance degradation in a multitasking, possibly
>>> virtualized environment.
>>> 
>>> The main limitation of this PM QoS approach is that whenever multiple
>>> clients request different ramp-up latency targets, only the strictest
>>> (lowest latency) one will apply system-wide, potentially leading to
>>> suboptimal energy efficiency for the less latency-sensitive clients,
>>> (though it won't artificially limit the CPU throughput of the most
>>> latency-sensitive clients as a result of the PM QoS requests placed
>>> by
>>> less latency-sensitive ones).  In order to address this limitation
>>> I'm
>>> working on a more complicated solution which integrates with the task
>>> scheduler in order to provide response latency control with process
>>> granularity (pretty much in the spirit of PELT).  One of the
>>> alternatives Rafael and I were discussing was to expose that through
>>> a
>>> third cgroup clamp on top of the MIN and MAX utilization clamps, but
>>> I'm open to any other possibilities regarding what the interface
>>> should look like.  Either way the current (scheduling-unaware) PM
>>> QoS-based interface should provide most of the benefit except in
>>> heavily multitasking environments.
>>> 
>>> A branch with this series in testable form can be found here [2],
>>> based on linux-next from a few days ago.  Another important
>>> difference
>>> with respect to my previous revision is that the present one targets
>>> HWP systems (though for the moment it's only enabled by default on
>>> ICL, even though that can be overridden through the kernel command
>>> line).  I have WIP code that uses the same governor in order to
>>> provide a similar benefit on non-HWP systems (like my previous
>>> revision), which can be found in this branch for reference [3] -- I'm
>>> planning to finish that up and send it as follow-up to this series
>>> assuming people are happy with the overall approach.
>>> 
>>> Thanks in advance for any review feed-back and test reports.
>>> 
>>> [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
>>> limit.
>>> [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU
>>> load.
>>> [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters
>>> via debugfs.
>>> [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from
>>> pstate_funcs"
>>> [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
>>> statistics and status calculation.
>>> [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target
>>> P-state range estimation.
>>> [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP
>>> parts.
>>> [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on
>>> ACPI FADT profile and CPUID.
>>> [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP
>>> controller status.
>>> [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller
>>> parameters via debugfs.
>>> 
>>> [1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
>>> [2] 
>>> https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
>>> [3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
>>> [4] 
>>> http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log
>>> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).
  2020-03-24 19:16     ` Francisco Jerez
@ 2020-03-24 20:03       ` Pandruvada, Srinivas
  0 siblings, 0 replies; 44+ messages in thread
From: Pandruvada, Srinivas @ 2020-03-24 20:03 UTC (permalink / raw)
  To: Brown, Len, linux-pm, currojerez, intel-gfx; +Cc: peterz, rjw

On Tue, 2020-03-24 at 12:16 -0700, Francisco Jerez wrote:
> Francisco Jerez <currojerez@riseup.net> writes:
> 
> > "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
> > 
> > > Hi Francisco,
> > > 
> > > On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
> > > > This is my second take on improving the energy efficiency of
> > > > the
> > > > intel_pstate driver under IO-bound conditions.  The problem and
> > > > approach to solve it are roughly the same as in my previous
> > > > series
> > > > [1]
> > > > at a high level:
> > > > 
> > > > In IO-bound scenarios (by definition) the throughput of the
> > > > system
> > > > doesn't improve with increasing CPU frequency beyond the
> > > > threshold
> > > > value at which the IO device becomes the bottleneck, however
> > > > with the
> > > > current governors (whether HWP is in use or not) the CPU
> > > > frequency
> > > > tends to oscillate with the load, often with an amplitude far
> > > > into
> > > > the
> > > > turbo range, leading to severely reduced energy efficiency,
> > > > which is
> > > > particularly problematic when a limited TDP budget is shared
> > > > among a
> > > > number of cores running some multithreaded workload, or among a
> > > > CPU
> > > > core and an integrated GPU.
> > > > 
> > > > Improving the energy efficiency of the CPU improves the
> > > > throughput of
> > > > the system in such TDP-limited conditions.  See [4] for some
> > > > preliminary benchmark results from a Razer Blade Stealth 13
> > > > Late
> > > > 2019/LY320 laptop with an Intel ICL processor and integrated
> > > > graphics,
> > > > including throughput results that range up to a ~15%
> > > > improvement and
> > > > performance-per-watt results up to a ~43% improvement
> > > > (estimated via
> > > > RAPL).  Particularly the throughput results may vary
> > > > substantially
> > > > from one platform to another depending on the TDP budget and
> > > > the
> > > > balance of load between CPU and GPU.
> > > > 
> > > 
> > > You changed the EPP to 0 intentionally or unintentionally. We
> > > know that
> > > all energy optimization will be disabled with this change. 
> > > This test was done on an ICL system.
> > > 
> > 
> > Hmm, that's bad, and fully unintentional.  It's probably a side
> > effect
> > of intel_pstate_reset_vlp() running before intel_pstate_hwp_set(),
> > which
> > could cause it to use an uninitialized value of hwp_req_cached
> > (zero?).
> > I'll fix it in v3.  Thanks a lot for pointing this out.
> > 
> 
> Sigh.  That means that the performance results I got were
> inadvertently
> obtained while using an EPP setting of "performance" (!).  That's
> unlikely to be the case in most systems but still kind of meaningful.
We know that  "performance" mode is not great for workloads which
depends on some power sharing.

Thanks,
Srinivas 

> Need to get updated performance numbers with EPP=0x80 -- The larger
> up
> to ~40% energy efficiency improvements still seem to be visible
> regardless, but the throughput benefit is likely to be lower than
> with
> EPP=0.
> 
> > > Basically without your patches on top of linux-next: EPP = 0x80
> > > $sudo rdmsr -a 0x774
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 
> > > 
> > > After your patches
> > > 
> > > $sudo rdmsr -a 0x774
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 
> > > I added some prints, basically you change the EPP at startup
> > > before
> > > regular HWP request update path and update on top. So boot up EPP
> > > is
> > > overwritten.
> > > 
> > > 
> > > [    5.867476] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.872426] intel_pstate_reset_vlp hwp_req:404
> > > [    5.881645] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.886634] intel_pstate_reset_vlp hwp_req:404
> > > [    5.895819] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.900958] intel_pstate_reset_vlp hwp_req:404
> > > [    5.910321] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.915406] intel_pstate_reset_vlp hwp_req:404
> > > [    5.924623] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.929564] intel_pstate_reset_vlp hwp_req:404
> > > [    5.944039] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.951672] intel_pstate_reset_vlp hwp_req:404
> > > [    5.966157] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.973808] intel_pstate_reset_vlp hwp_req:404
> > > [    5.988223] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.995823] intel_pstate_reset_vlp hwp_req:404
> > > [    6.010062] intel_pstate: HWP enabled
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > 
> > > 
> > > > One of the main differences relative to my previous version is
> > > > that
> > > > the trade-off between energy efficiency and frequency ramp-up
> > > > latency
> > > > is now exposed to device drivers through a new PM QoS class [It
> > > > would
> > > > make sense to expose it to userspace too eventually but that's
> > > > beyond
> > > > the purpose of this series].  The new PM QoS class provides a
> > > > latency
> > > > target to CPUFREQ governors which gives them permission to
> > > > filter out
> > > > CPU frequency oscillations with a period significantly shorter
> > > > than
> > > > the specified target, whenever doing so leads to improved
> > > > energy
> > > > efficiency.
> > > > 
> > > > This series takes advantage of the new PM QoS class from the
> > > > i915
> > > > driver whenever the driver determines that the GPU has become a
> > > > bottleneck for an extended period of time.  At that point it
> > > > places a
> > > > PM QoS ramp-up latency target which causes CPUFREQ to limit the
> > > > CPU
> > > > to
> > > > a reasonably energy-efficient frequency able to at least
> > > > achieve the
> > > > required amount of work in a time window approximately equal to
> > > > the
> > > > ramp-up latency target (since any longer-term energy efficiency
> > > > optimization would potentially violate the latency
> > > > target).  This
> > > > seems more effective than clamping the CPU frequency to a fixed
> > > > value
> > > > directly from various subsystems, since the CPU is a shared
> > > > resource,
> > > > so the frequency bound needs to consider the load and latency
> > > > requirements of all independent workloads running on the same
> > > > CPU
> > > > core
> > > > in order to avoid performance degradation in a multitasking,
> > > > possibly
> > > > virtualized environment.
> > > > 
> > > > The main limitation of this PM QoS approach is that whenever
> > > > multiple
> > > > clients request different ramp-up latency targets, only the
> > > > strictest
> > > > (lowest latency) one will apply system-wide, potentially
> > > > leading to
> > > > suboptimal energy efficiency for the less latency-sensitive
> > > > clients,
> > > > (though it won't artificially limit the CPU throughput of the
> > > > most
> > > > latency-sensitive clients as a result of the PM QoS requests
> > > > placed
> > > > by
> > > > less latency-sensitive ones).  In order to address this
> > > > limitation
> > > > I'm
> > > > working on a more complicated solution which integrates with
> > > > the task
> > > > scheduler in order to provide response latency control with
> > > > process
> > > > granularity (pretty much in the spirit of PELT).  One of the
> > > > alternatives Rafael and I were discussing was to expose that
> > > > through
> > > > a
> > > > third cgroup clamp on top of the MIN and MAX utilization
> > > > clamps, but
> > > > I'm open to any other possibilities regarding what the
> > > > interface
> > > > should look like.  Either way the current (scheduling-unaware)
> > > > PM
> > > > QoS-based interface should provide most of the benefit except
> > > > in
> > > > heavily multitasking environments.
> > > > 
> > > > A branch with this series in testable form can be found here
> > > > [2],
> > > > based on linux-next from a few days ago.  Another important
> > > > difference
> > > > with respect to my previous revision is that the present one
> > > > targets
> > > > HWP systems (though for the moment it's only enabled by default
> > > > on
> > > > ICL, even though that can be overridden through the kernel
> > > > command
> > > > line).  I have WIP code that uses the same governor in order to
> > > > provide a similar benefit on non-HWP systems (like my previous
> > > > revision), which can be found in this branch for reference [3]
> > > > -- I'm
> > > > planning to finish that up and send it as follow-up to this
> > > > series
> > > > assuming people are happy with the overall approach.
> > > > 
> > > > Thanks in advance for any review feed-back and test reports.
> > > > 
> > > > [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
> > > > limit.
> > > > [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based
> > > > on GPU
> > > > load.
> > > > [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control
> > > > parameters
> > > > via debugfs.
> > > > [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util
> > > > from
> > > > pstate_funcs"
> > > > [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
> > > > statistics and status calculation.
> > > > [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller
> > > > target
> > > > P-state range estimation.
> > > > [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller
> > > > for HWP
> > > > parts.
> > > > [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller
> > > > based on
> > > > ACPI FADT profile and CPUID.
> > > > [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of
> > > > VLP
> > > > controller status.
> > > > [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP
> > > > controller
> > > > parameters via debugfs.
> > > > 
> > > > [1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
> > > > [2] 
> > > > https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
> > > > [3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
> > > > [4] 
> > > > http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log
> > > > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-03-24 20:03 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10 21:41 [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
2020-03-10 21:41 ` [Intel-gfx] [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit Francisco Jerez
2020-03-11 12:42   ` Peter Zijlstra
2020-03-11 19:23     ` Francisco Jerez
2020-03-11 19:23       ` [Intel-gfx] [PATCHv2 " Francisco Jerez
2020-03-19 10:25         ` Rafael J. Wysocki
2020-03-10 21:41 ` [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load Francisco Jerez
2020-03-10 22:26   ` Chris Wilson
2020-03-11  0:34     ` Francisco Jerez
2020-03-18 19:42       ` Francisco Jerez
2020-03-20  2:46         ` Francisco Jerez
2020-03-20 10:06           ` Chris Wilson
2020-03-11 10:00     ` Tvrtko Ursulin
2020-03-11 10:21       ` Chris Wilson
2020-03-11 19:54       ` Francisco Jerez
2020-03-12 11:52         ` Tvrtko Ursulin
2020-03-13  7:39           ` Francisco Jerez
2020-03-16 20:54             ` Francisco Jerez
2020-03-10 21:41 ` [Intel-gfx] [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs Francisco Jerez
2020-03-10 21:41 ` [Intel-gfx] [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
2020-03-19 10:45   ` Rafael J. Wysocki
2020-03-10 21:41 ` [Intel-gfx] [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation Francisco Jerez
2020-03-19 11:06   ` Rafael J. Wysocki
2020-03-10 21:41 ` [Intel-gfx] [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation Francisco Jerez
2020-03-19 11:12   ` Rafael J. Wysocki
2020-03-10 21:42 ` [Intel-gfx] [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts Francisco Jerez
2020-03-17 23:59   ` Pandruvada, Srinivas
2020-03-18 19:51     ` Francisco Jerez
2020-03-18 20:10       ` Pandruvada, Srinivas
2020-03-18 20:22         ` Francisco Jerez
2020-03-23 20:13           ` Pandruvada, Srinivas
2020-03-10 21:42 ` [Intel-gfx] [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID Francisco Jerez
2020-03-19 11:20   ` Rafael J. Wysocki
2020-03-10 21:42 ` [Intel-gfx] [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status Francisco Jerez
2020-03-10 21:42 ` [Intel-gfx] [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs Francisco Jerez
2020-03-11  2:35 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
2020-03-11  3:55   ` Francisco Jerez
2020-03-11  4:25 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
2020-03-12  2:31 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2) Patchwork
2020-03-12  2:32 ` Patchwork
2020-03-23 23:29 ` [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
2020-03-24  0:23   ` Francisco Jerez
2020-03-24 19:16     ` Francisco Jerez
2020-03-24 20:03       ` Pandruvada, Srinivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).