All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-04-28  3:22 ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

This addresses the technical concerns people brought up about my
previous v2 revision of this series.  Other than a few bug fixes, the
only major change relative to v2 is that the controller is now exposed
as a new CPUFREQ generic governor as requested by Rafael (named
"adaptive" in this RFC though other naming suggestions are welcome).
Main reason for calling this v2.99 rather than v3 is that I haven't
yet addressed all the documentation requests from the v2 thread --
Will spend some time doing that as soon as I have an ACK (ideally from
Rafael) that things are moving in the right direction.

You can also find this series along with the WIP code for non-HWP
platforms in this branch:

https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99

Thanks!

[PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
[PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
[PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
[PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
[PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
[PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
[PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
[PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
[PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
[PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
[PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-04-28  3:22 ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

This addresses the technical concerns people brought up about my
previous v2 revision of this series.  Other than a few bug fixes, the
only major change relative to v2 is that the controller is now exposed
as a new CPUFREQ generic governor as requested by Rafael (named
"adaptive" in this RFC though other naming suggestions are welcome).
Main reason for calling this v2.99 rather than v3 is that I haven't
yet addressed all the documentation requests from the v2 thread --
Will spend some time doing that as soon as I have an ACK (ideally from
Rafael) that things are moving in the right direction.

You can also find this series along with the WIP code for non-HWP
platforms in this branch:

https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99

Thanks!

[PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
[PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
[PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
[PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
[PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
[PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
[PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
[PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
[PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
[PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
[PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

The purpose of this PM QoS limit is to give device drivers additional
control over the latency/energy efficiency trade-off made by the PM
subsystem (particularly the CPUFREQ governor).  It allows device
drivers to set a lower bound on the response latency of PM (defined as
the time it takes from wake-up to the CPU reaching a certain
steady-state level of performance [e.g. the nominal frequency] in
response to a step-function load).  It reports to PM the minimum
ramp-up latency considered of use to the application, and explicitly
requests PM to filter out oscillations faster than the specified
frequency.  It is somewhat complementary to the current
CPU_DMA_LATENCY PM QoS class which can be understood as specifying an
upper latency bound on the CPU wake-up time, instead of a lower bound
on the CPU frequency ramp-up time.

Note that even though this provides a latency constraint it's
represented as its reciprocal in Hz units for computational efficiency
(since it would take a 64-bit division to compute the number of cycles
elapsed from a time increment in nanoseconds and a time bound, while a
frequency can simply be multiplied with the time increment).

This implements a MAX constraint so that the strictest (highest
response frequency) request is honored.  This means that PM won't
provide any guarantee that frequencies greater than the specified
bound will be filtered, since that might be incompatible with the
constraints specified by another more latency-sensitive application (A
more fine-grained result could be achieved with a scheduling-based
interface).  The default value needs to be equal to zero (best effort)
for it to behave as identity of the MAX operation.

v2: Drop wake_up_all_idle_cpus() call from
    cpu_response_frequency_qos_apply() (Peter).
v3: Rename CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 include/linux/pm_qos.h       |   9 +++
 include/trace/events/power.h |  33 +++++----
 kernel/power/qos.c           | 131 ++++++++++++++++++++++++++++++++++-
 3 files changed, 155 insertions(+), 18 deletions(-)

diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 4a69d4af3ff8..9c1561ebbdc7 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -28,6 +28,7 @@ enum pm_qos_flags_status {
 #define PM_QOS_LATENCY_ANY_NS	((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC)
 
 #define PM_QOS_CPU_LATENCY_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
+#define PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE 0
 #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS	PM_QOS_LATENCY_ANY_NS
@@ -162,6 +163,14 @@ static inline void cpu_latency_qos_update_request(struct pm_qos_request *req,
 static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) {}
 #endif
 
+s32 cpu_scaling_response_qos_limit(void);
+bool cpu_scaling_response_qos_request_active(struct pm_qos_request *req);
+void cpu_scaling_response_qos_add_request(struct pm_qos_request *req,
+					  s32 value);
+void cpu_scaling_response_qos_update_request(struct pm_qos_request *req,
+					     s32 new_value);
+void cpu_scaling_response_qos_remove_request(struct pm_qos_request *req);
+
 #ifdef CONFIG_PM
 enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
 enum pm_qos_flags_status dev_pm_qos_flags(struct device *dev, s32 mask);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index af5018aa9517..9065a3b084c4 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -359,45 +359,48 @@ DEFINE_EVENT(power_domain, power_domain_target,
 );
 
 /*
- * CPU latency QoS events used for global CPU latency QoS list updates
+ * CPU latency/scaling response frequency QoS events used for global
+ * CPU PM QoS list updates.
  */
-DECLARE_EVENT_CLASS(cpu_latency_qos_request,
+DECLARE_EVENT_CLASS(pm_qos_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value),
+	TP_ARGS(name, value),
 
 	TP_STRUCT__entry(
+		__string(name,			 name		)
 		__field( s32,                    value          )
 	),
 
 	TP_fast_assign(
+		__assign_str(name, name);
 		__entry->value = value;
 	),
 
-	TP_printk("CPU_DMA_LATENCY value=%d",
-		  __entry->value)
+	TP_printk("pm_qos_class=%s value=%d",
+		  __get_str(name), __entry->value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_add_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_add_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_update_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_update_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_remove_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_remove_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
 /*
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index db0bed2cae26..d080e71f4cd6 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -271,7 +271,7 @@ void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value)
 		return;
 	}
 
-	trace_pm_qos_add_request(value);
+	trace_pm_qos_add_request("CPU_DMA_LATENCY", value);
 
 	req->qos = &cpu_latency_constraints;
 	cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value);
@@ -297,7 +297,7 @@ void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value)
 		return;
 	}
 
-	trace_pm_qos_update_request(new_value);
+	trace_pm_qos_update_request("CPU_DMA_LATENCY", new_value);
 
 	if (new_value == req->node.prio)
 		return;
@@ -323,7 +323,7 @@ void cpu_latency_qos_remove_request(struct pm_qos_request *req)
 		return;
 	}
 
-	trace_pm_qos_remove_request(PM_QOS_DEFAULT_VALUE);
+	trace_pm_qos_remove_request("CPU_DMA_LATENCY", PM_QOS_DEFAULT_VALUE);
 
 	cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
 	memset(req, 0, sizeof(*req));
@@ -424,6 +424,131 @@ static int __init cpu_latency_qos_init(void)
 late_initcall(cpu_latency_qos_init);
 #endif /* CONFIG_CPU_IDLE */
 
+/* Definitions related to the CPU scaling response frequency QoS. */
+
+static struct pm_qos_constraints cpu_scaling_response_constraints = {
+	.list = PLIST_HEAD_INIT(cpu_scaling_response_constraints.list),
+	.target_value = PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE,
+	.default_value = PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE,
+	.no_constraint_value = PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE,
+	.type = PM_QOS_MAX,
+};
+
+/**
+ * cpu_scaling_response_qos_limit - Return current system-wide CPU
+ *				    scaling response frequency QoS limit.
+ */
+s32 cpu_scaling_response_qos_limit(void)
+{
+	return pm_qos_read_value(&cpu_scaling_response_constraints);
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_limit);
+
+/**
+ * cpu_scaling_response_qos_request_active - Check the given PM QoS request.
+ * @req: PM QoS request to check.
+ *
+ * Return: 'true' if @req has been added to the CPU response frequency
+ * QoS list, 'false' otherwise.
+ */
+bool cpu_scaling_response_qos_request_active(struct pm_qos_request *req)
+{
+	return req->qos == &cpu_scaling_response_constraints;
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_request_active);
+
+/**
+ * cpu_scaling_response_qos_add_request - Add new CPU scaling response
+ *					  frequency QoS request.
+ * @req: Pointer to a preallocated handle.
+ * @value: Requested constraint value.
+ *
+ * Use @value to initialize the request handle pointed to by @req,
+ * insert it as a new entry to the CPU scaling response frequency QoS
+ * list and recompute the effective QoS constraint for that list.
+ *
+ * Callers need to save the handle for later use in updates and removal of the
+ * QoS request represented by it.
+ */
+void cpu_scaling_response_qos_add_request(struct pm_qos_request *req,
+					  s32 value)
+{
+	if (!req)
+		return;
+
+	if (cpu_scaling_response_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for already added request\n",
+		     __func__);
+		return;
+	}
+
+	trace_pm_qos_add_request("CPU_SCALING_RESPONSE", value);
+
+	req->qos = &cpu_scaling_response_constraints;
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_ADD_REQ, value);
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_add_request);
+
+/**
+ * cpu_scaling_response_qos_update_request - Modify existing CPU scaling
+ *					     response frequency QoS request.
+ * @req : QoS request to update.
+ * @new_value: New requested constraint value.
+ *
+ * Use @new_value to update the QoS request represented by @req in the
+ * CPU scaling response frequency QoS list along with updating the
+ * effective constraint value for that list.
+ */
+void cpu_scaling_response_qos_update_request(struct pm_qos_request *req,
+					     s32 new_value)
+{
+	if (!req)
+		return;
+
+	if (!cpu_scaling_response_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_update_request("CPU_SCALING_RESPONSE", new_value);
+
+	if (new_value == req->node.prio)
+		return;
+
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_UPDATE_REQ,
+			     new_value);
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_update_request);
+
+/**
+ * cpu_scaling_response_qos_remove_request - Remove existing CPU
+ *					     scaling response
+ *					     frequency QoS request.
+ * @req: QoS request to remove.
+ *
+ * Remove the CPU scaling response frequency QoS request represented
+ * by @req from the CPU scaling response frequency QoS list along with
+ * updating the effective constraint value for that list.
+ */
+void cpu_scaling_response_qos_remove_request(struct pm_qos_request *req)
+{
+	if (!req)
+		return;
+
+	if (!cpu_scaling_response_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_remove_request("CPU_SCALING_RESPONSE",
+				    PM_QOS_DEFAULT_VALUE);
+
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_REMOVE_REQ,
+			     PM_QOS_DEFAULT_VALUE);
+	memset(req, 0, sizeof(*req));
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_remove_request);
+
 /* Definitions related to the frequency QoS below. */
 
 /**
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

The purpose of this PM QoS limit is to give device drivers additional
control over the latency/energy efficiency trade-off made by the PM
subsystem (particularly the CPUFREQ governor).  It allows device
drivers to set a lower bound on the response latency of PM (defined as
the time it takes from wake-up to the CPU reaching a certain
steady-state level of performance [e.g. the nominal frequency] in
response to a step-function load).  It reports to PM the minimum
ramp-up latency considered of use to the application, and explicitly
requests PM to filter out oscillations faster than the specified
frequency.  It is somewhat complementary to the current
CPU_DMA_LATENCY PM QoS class which can be understood as specifying an
upper latency bound on the CPU wake-up time, instead of a lower bound
on the CPU frequency ramp-up time.

Note that even though this provides a latency constraint it's
represented as its reciprocal in Hz units for computational efficiency
(since it would take a 64-bit division to compute the number of cycles
elapsed from a time increment in nanoseconds and a time bound, while a
frequency can simply be multiplied with the time increment).

This implements a MAX constraint so that the strictest (highest
response frequency) request is honored.  This means that PM won't
provide any guarantee that frequencies greater than the specified
bound will be filtered, since that might be incompatible with the
constraints specified by another more latency-sensitive application (A
more fine-grained result could be achieved with a scheduling-based
interface).  The default value needs to be equal to zero (best effort)
for it to behave as identity of the MAX operation.

v2: Drop wake_up_all_idle_cpus() call from
    cpu_response_frequency_qos_apply() (Peter).
v3: Rename CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 include/linux/pm_qos.h       |   9 +++
 include/trace/events/power.h |  33 +++++----
 kernel/power/qos.c           | 131 ++++++++++++++++++++++++++++++++++-
 3 files changed, 155 insertions(+), 18 deletions(-)

diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 4a69d4af3ff8..9c1561ebbdc7 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -28,6 +28,7 @@ enum pm_qos_flags_status {
 #define PM_QOS_LATENCY_ANY_NS	((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC)
 
 #define PM_QOS_CPU_LATENCY_DEFAULT_VALUE	(2000 * USEC_PER_SEC)
+#define PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE 0
 #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT	PM_QOS_LATENCY_ANY
 #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS	PM_QOS_LATENCY_ANY_NS
@@ -162,6 +163,14 @@ static inline void cpu_latency_qos_update_request(struct pm_qos_request *req,
 static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) {}
 #endif
 
+s32 cpu_scaling_response_qos_limit(void);
+bool cpu_scaling_response_qos_request_active(struct pm_qos_request *req);
+void cpu_scaling_response_qos_add_request(struct pm_qos_request *req,
+					  s32 value);
+void cpu_scaling_response_qos_update_request(struct pm_qos_request *req,
+					     s32 new_value);
+void cpu_scaling_response_qos_remove_request(struct pm_qos_request *req);
+
 #ifdef CONFIG_PM
 enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
 enum pm_qos_flags_status dev_pm_qos_flags(struct device *dev, s32 mask);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index af5018aa9517..9065a3b084c4 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -359,45 +359,48 @@ DEFINE_EVENT(power_domain, power_domain_target,
 );
 
 /*
- * CPU latency QoS events used for global CPU latency QoS list updates
+ * CPU latency/scaling response frequency QoS events used for global
+ * CPU PM QoS list updates.
  */
-DECLARE_EVENT_CLASS(cpu_latency_qos_request,
+DECLARE_EVENT_CLASS(pm_qos_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value),
+	TP_ARGS(name, value),
 
 	TP_STRUCT__entry(
+		__string(name,			 name		)
 		__field( s32,                    value          )
 	),
 
 	TP_fast_assign(
+		__assign_str(name, name);
 		__entry->value = value;
 	),
 
-	TP_printk("CPU_DMA_LATENCY value=%d",
-		  __entry->value)
+	TP_printk("pm_qos_class=%s value=%d",
+		  __get_str(name), __entry->value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_add_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_add_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_update_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_update_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
-DEFINE_EVENT(cpu_latency_qos_request, pm_qos_remove_request,
+DEFINE_EVENT(pm_qos_request, pm_qos_remove_request,
 
-	TP_PROTO(s32 value),
+	TP_PROTO(const char *name, s32 value),
 
-	TP_ARGS(value)
+	TP_ARGS(name, value)
 );
 
 /*
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index db0bed2cae26..d080e71f4cd6 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -271,7 +271,7 @@ void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value)
 		return;
 	}
 
-	trace_pm_qos_add_request(value);
+	trace_pm_qos_add_request("CPU_DMA_LATENCY", value);
 
 	req->qos = &cpu_latency_constraints;
 	cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value);
@@ -297,7 +297,7 @@ void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value)
 		return;
 	}
 
-	trace_pm_qos_update_request(new_value);
+	trace_pm_qos_update_request("CPU_DMA_LATENCY", new_value);
 
 	if (new_value == req->node.prio)
 		return;
@@ -323,7 +323,7 @@ void cpu_latency_qos_remove_request(struct pm_qos_request *req)
 		return;
 	}
 
-	trace_pm_qos_remove_request(PM_QOS_DEFAULT_VALUE);
+	trace_pm_qos_remove_request("CPU_DMA_LATENCY", PM_QOS_DEFAULT_VALUE);
 
 	cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
 	memset(req, 0, sizeof(*req));
@@ -424,6 +424,131 @@ static int __init cpu_latency_qos_init(void)
 late_initcall(cpu_latency_qos_init);
 #endif /* CONFIG_CPU_IDLE */
 
+/* Definitions related to the CPU scaling response frequency QoS. */
+
+static struct pm_qos_constraints cpu_scaling_response_constraints = {
+	.list = PLIST_HEAD_INIT(cpu_scaling_response_constraints.list),
+	.target_value = PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE,
+	.default_value = PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE,
+	.no_constraint_value = PM_QOS_CPU_SCALING_RESPONSE_DEFAULT_VALUE,
+	.type = PM_QOS_MAX,
+};
+
+/**
+ * cpu_scaling_response_qos_limit - Return current system-wide CPU
+ *				    scaling response frequency QoS limit.
+ */
+s32 cpu_scaling_response_qos_limit(void)
+{
+	return pm_qos_read_value(&cpu_scaling_response_constraints);
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_limit);
+
+/**
+ * cpu_scaling_response_qos_request_active - Check the given PM QoS request.
+ * @req: PM QoS request to check.
+ *
+ * Return: 'true' if @req has been added to the CPU response frequency
+ * QoS list, 'false' otherwise.
+ */
+bool cpu_scaling_response_qos_request_active(struct pm_qos_request *req)
+{
+	return req->qos == &cpu_scaling_response_constraints;
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_request_active);
+
+/**
+ * cpu_scaling_response_qos_add_request - Add new CPU scaling response
+ *					  frequency QoS request.
+ * @req: Pointer to a preallocated handle.
+ * @value: Requested constraint value.
+ *
+ * Use @value to initialize the request handle pointed to by @req,
+ * insert it as a new entry to the CPU scaling response frequency QoS
+ * list and recompute the effective QoS constraint for that list.
+ *
+ * Callers need to save the handle for later use in updates and removal of the
+ * QoS request represented by it.
+ */
+void cpu_scaling_response_qos_add_request(struct pm_qos_request *req,
+					  s32 value)
+{
+	if (!req)
+		return;
+
+	if (cpu_scaling_response_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for already added request\n",
+		     __func__);
+		return;
+	}
+
+	trace_pm_qos_add_request("CPU_SCALING_RESPONSE", value);
+
+	req->qos = &cpu_scaling_response_constraints;
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_ADD_REQ, value);
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_add_request);
+
+/**
+ * cpu_scaling_response_qos_update_request - Modify existing CPU scaling
+ *					     response frequency QoS request.
+ * @req : QoS request to update.
+ * @new_value: New requested constraint value.
+ *
+ * Use @new_value to update the QoS request represented by @req in the
+ * CPU scaling response frequency QoS list along with updating the
+ * effective constraint value for that list.
+ */
+void cpu_scaling_response_qos_update_request(struct pm_qos_request *req,
+					     s32 new_value)
+{
+	if (!req)
+		return;
+
+	if (!cpu_scaling_response_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_update_request("CPU_SCALING_RESPONSE", new_value);
+
+	if (new_value == req->node.prio)
+		return;
+
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_UPDATE_REQ,
+			     new_value);
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_update_request);
+
+/**
+ * cpu_scaling_response_qos_remove_request - Remove existing CPU
+ *					     scaling response
+ *					     frequency QoS request.
+ * @req: QoS request to remove.
+ *
+ * Remove the CPU scaling response frequency QoS request represented
+ * by @req from the CPU scaling response frequency QoS list along with
+ * updating the effective constraint value for that list.
+ */
+void cpu_scaling_response_qos_remove_request(struct pm_qos_request *req)
+{
+	if (!req)
+		return;
+
+	if (!cpu_scaling_response_qos_request_active(req)) {
+		WARN(1, KERN_ERR "%s called for unknown object\n", __func__);
+		return;
+	}
+
+	trace_pm_qos_remove_request("CPU_SCALING_RESPONSE",
+				    PM_QOS_DEFAULT_VALUE);
+
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_REMOVE_REQ,
+			     PM_QOS_DEFAULT_VALUE);
+	memset(req, 0, sizeof(*req));
+}
+EXPORT_SYMBOL_GPL(cpu_scaling_response_qos_remove_request);
+
 /* Definitions related to the frequency QoS below. */
 
 /**
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

This allows CPUFREQ governors to realize when the system becomes
non-CPU-bound due to GPU rendering activity, and cause them to respond
more conservatively to the workload by limiting their response
frequency: CPU energy usage will be reduced when there isn't a good
chance for system performance to scale with CPU frequency due to the
GPU bottleneck.  This leaves additional TDP budget available for the
GPU to reach higher frequencies, which may be translated into an
improvement in graphics performance to the extent that the workload
remains TDP-limited.  If the workload isn't (anymore) TDP-limited
performance should stay roughly constant, but energy usage will be
divided by a similar factor.

The metric used by this patch in order to determine whether the GPU is
unlikely to be a bottleneck may not be particularly obvious: We only
specify a reduced PM QoS response frequency target whenever both
execlists ELSP ports are simultaneously active, since in that case we
know that the completion of one context will lead to the immediate
execution of another, which means that the GPU can be kept busy
without the execlists submission code rushing to submit a new request,
so CPU latency shouldn't be a concern.

This might miss some workloads that could theoretically benefit from
this optimization, since some applications are unable to keep both
ELSP ports active for a significant fraction of time even though they
are GPU-bound, however using the single-ELSP utilization as metric
would neglect the CPU latency-sensitivity of the execlists submission
code, which would lead to large regressions in x11perf and
jxrendermark.  For that reason this patch takes the rather
conservative approach of restricting the optimization to workloads
that effectively utilize both ELSP ports, which indicates that command
submission latency is unlikely to be an issue.

Note that this is currently only enabled for execlists submission.  It
might be beneficial to do the same thing in combination with GuC
submission, but the metric would be slightly different since we
wouldn't need to care about multiple ELSP ports being in use.

In order to further prevent regressions the optimization is enabled
with a delay in order to avoid performance degradation of applications
that quickly switch back and forth between being GPU-bound and
CPU-bound.  A reduced PM QoS scaling response frequency target will
only be specified if the GPU has been continuously utilized for a long
enough period of time.

v3: Assorted clean-ups (Chris).  Improved documentation (Tvrtko).  Fix
    interaction with single-ELSP preemption (Chris).  Move overload
    signalling to process_csb() to reduce bias due to interrupt
    latency (Francisco).  Rename CPU_RESPONSE_FREQUENCY to
    CPU_SCALING_RESPONSE (Rafael).  Adjust heuristic parameters to
    avoid regressions from other v3 governor changes (Francisco).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c    |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  11 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c        | 107 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h        |   3 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h     |  49 +++++++++
 drivers/gpu/drm/i915/gt/intel_lrc.c          |  17 +++
 6 files changed, 188 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index b1f8527f02c8..6b08a9ad2de1 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -517,6 +517,7 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine)
 
 	execlists->queue_priority_hint = INT_MIN;
 	execlists->queue = RB_ROOT_CACHED;
+	atomic_set(&execlists->overload, 0);
 }
 
 static void cleanup_status_page(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index bf395227c99f..9bdb3958dbb7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -275,6 +275,17 @@ struct intel_engine_execlists {
 	 */
 	u8 csb_head;
 
+	/**
+	 * @overload: whether at least two execlist ports are
+	 * currently submitted to the hardware, indicating that CPU
+	 * latency isn't critical in order to maintain the GPU busy.
+	 * We use that to trigger a more energy-efficient response
+	 * mode of CPU power management, since performance degradation
+	 * is unlikely under those conditions, and GPU throughput may
+	 * benefit from the increased TDP budget.
+	 */
+	atomic_t overload;
+
 	I915_SELFTEST_DECLARE(struct st_preempt_hang preempt_hang;)
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 6bdb74892a1e..0d44ef3a07ad 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -107,6 +107,102 @@ void intel_gt_pm_init_early(struct intel_gt *gt)
 	intel_wakeref_init(&gt->wakeref, gt->uncore->rpm, &wf_ops);
 }
 
+/**
+ * Time increment until the most immediate PM QoS scaling response
+ * frequency update.
+ *
+ * May be in the future (return value > 0) if the GPU is currently
+ * active but we haven't updated the PM QoS request to reflect a
+ * bottleneck yet.  May be in the past (return value < 0) if the GPU
+ * isn't fully utilized and we've already reset the PM QoS request to
+ * the default value.  May be zero if a PM QoS request update is due.
+ *
+ * The time increment returned by this function decreases linearly
+ * with time until it reaches either zero or a configurable limit.
+ */
+static int32_t time_to_sf_qos_update_ns(struct intel_gt *gt)
+{
+	const uint64_t t1 = ktime_get_ns();
+	const uint64_t dt1 = gt->sf_qos.delay_max_ns;
+
+	if (atomic_read_acquire(&gt->sf_qos.active_count)) {
+		const uint64_t t0 = atomic64_read(&gt->sf_qos.time_set_ns);
+
+		return min(dt1, t0 <= t1 ? 0 : t0 - t1);
+	} else {
+		const uint64_t t0 = atomic64_read(&gt->sf_qos.time_clear_ns);
+		const unsigned int shift = gt->sf_qos.delay_slope_shift;
+
+		return -(int32_t)(t1 <= t0 ? 1 :
+				  min(dt1, (t1 - t0) << shift));
+	}
+}
+
+/**
+ * Perform a delayed PM QoS scaling response frequency update.
+ */
+static void intel_gt_sf_qos_update(struct intel_gt *gt)
+{
+	const uint32_t dt = max(0, time_to_sf_qos_update_ns(gt));
+
+	timer_reduce(&gt->sf_qos.timer, jiffies + nsecs_to_jiffies(dt));
+}
+
+/**
+ * Timer that fires once the delay used to switch the PM QoS scaling
+ * response frequency request has elapsed.
+ */
+static void intel_gt_sf_qos_timeout(struct timer_list *timer)
+{
+	struct intel_gt *gt = container_of(timer, struct intel_gt,
+					   sf_qos.timer);
+	const int32_t dt = time_to_sf_qos_update_ns(gt);
+
+	if (dt == 0)
+		cpu_scaling_response_qos_update_request(
+			&gt->sf_qos.req, gt->sf_qos.target_hz);
+	else
+		cpu_scaling_response_qos_update_request(
+			&gt->sf_qos.req, PM_QOS_DEFAULT_VALUE);
+
+	if (dt > 0)
+		intel_gt_sf_qos_update(gt);
+}
+
+/**
+ * Report the beginning of a period of GPU utilization to PM.
+ *
+ * May trigger a more energy-efficient response mode in CPU PM, but
+ * only after a certain delay has elapsed so we don't have a negative
+ * impact on the CPU ramp-up latency except after the GPU has been
+ * continuously utilized for a long enough period of time.
+ */
+void intel_gt_pm_active_begin(struct intel_gt *gt)
+{
+	const uint32_t dt = abs(time_to_sf_qos_update_ns(gt));
+
+	atomic64_set(&gt->sf_qos.time_set_ns, ktime_get_ns() + dt);
+
+	if (!atomic_fetch_inc_release(&gt->sf_qos.active_count))
+		intel_gt_sf_qos_update(gt);
+}
+
+/**
+ * Report the end of a period of GPU utilization to PM.
+ *
+ * Must be called once after each call to intel_gt_pm_active_begin().
+ */
+void intel_gt_pm_active_end(struct intel_gt *gt)
+{
+	const uint32_t dt = abs(time_to_sf_qos_update_ns(gt));
+	const unsigned int shift = gt->sf_qos.delay_slope_shift;
+
+	atomic64_set(&gt->sf_qos.time_clear_ns, ktime_get_ns() - (dt >> shift));
+
+	if (!atomic_dec_return_release(&gt->sf_qos.active_count))
+		intel_gt_sf_qos_update(gt);
+}
+
 void intel_gt_pm_init(struct intel_gt *gt)
 {
 	/*
@@ -116,6 +212,14 @@ void intel_gt_pm_init(struct intel_gt *gt)
 	 */
 	intel_rc6_init(&gt->rc6);
 	intel_rps_init(&gt->rps);
+
+	cpu_scaling_response_qos_add_request(&gt->sf_qos.req,
+					     PM_QOS_DEFAULT_VALUE);
+
+	gt->sf_qos.delay_max_ns = 10000000;
+	gt->sf_qos.delay_slope_shift = 1;
+	gt->sf_qos.target_hz = 2;
+	timer_setup(&gt->sf_qos.timer, intel_gt_sf_qos_timeout, 0);
 }
 
 static bool reset_engines(struct intel_gt *gt)
@@ -174,6 +278,9 @@ static void gt_sanitize(struct intel_gt *gt, bool force)
 
 void intel_gt_pm_fini(struct intel_gt *gt)
 {
+	del_timer_sync(&gt->sf_qos.timer);
+	cpu_scaling_response_qos_remove_request(&gt->sf_qos.req);
+
 	intel_rc6_fini(&gt->rc6);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 60f0e2fbe55c..43f1d45fb0db 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -58,6 +58,9 @@ int intel_gt_resume(struct intel_gt *gt);
 void intel_gt_runtime_suspend(struct intel_gt *gt);
 int intel_gt_runtime_resume(struct intel_gt *gt);
 
+void intel_gt_pm_active_begin(struct intel_gt *gt);
+void intel_gt_pm_active_end(struct intel_gt *gt);
+
 static inline bool is_mock_gt(const struct intel_gt *gt)
 {
 	return I915_SELFTEST_ONLY(gt->awake == -ENODEV);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 96890dd12b5f..8aaeb2450d05 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -10,6 +10,7 @@
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pm_qos.h>
 #include <linux/spinlock.h>
 #include <linux/types.h>
 
@@ -97,6 +98,54 @@ struct intel_gt {
 	 * Reserved for exclusive use by the kernel.
 	 */
 	struct i915_address_space *vm;
+
+	/**
+	 * CPU response frequency QoS tracking.
+	 */
+	struct {
+		/** PM QoS request of this device. */
+		struct pm_qos_request req;
+
+		/** Timer used for delayed update of the PM QoS request. */
+		struct timer_list timer;
+
+		/** Response frequency target to use in GPU-bound conditions. */
+		uint32_t target_hz;
+
+		/**
+		 * Maximum delay before the PM QoS request is updated
+		 * after we become GPU-bound.
+		 */
+		uint32_t delay_max_ns;
+
+		/**
+		 * Exponent of delay slope used when the workload
+		 * becomes non-GPU-bound, used to provide greater
+		 * sensitivity to periods of GPU inactivity which may
+		 * indicate that the workload is latency-bound.
+		 */
+		uint32_t delay_slope_shift;
+
+		/**
+		 * Last time intel_gt_pm_active_begin() was called to
+		 * indicate that the GPU is a bottleneck.
+		 */
+		atomic64_t time_set_ns;
+
+		/**
+		 * Last time intel_gt_pm_active_end() was called to
+		 * indicate that the GPU is no longer a bottleneck.
+		 */
+		atomic64_t time_clear_ns;
+
+		/**
+		 * Number of times intel_gt_pm_active_begin() was
+		 * called without a matching intel_gt_pm_active_end().
+		 * Will be greater than zero if the GPU is currently
+		 * considered to be a bottleneck.
+		 */
+		atomic_t active_count;
+	} sf_qos;
 };
 
 enum intel_gt_scratch_field {
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index be5d6b71b6b0..767fa88f4d20 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2365,6 +2365,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
 
 	smp_wmb(); /* complete the seqlock for execlists_active() */
 	WRITE_ONCE(execlists->active, execlists->inflight);
+
+	if (atomic_xchg(&execlists->overload, 0)) {
+		struct intel_engine_cs *engine =
+			container_of(execlists, typeof(*engine), execlists);
+		intel_gt_pm_active_end(engine->gt);
+	}
 }
 
 static inline void
@@ -2533,12 +2539,23 @@ static void process_csb(struct intel_engine_cs *engine)
 			WRITE_ONCE(execlists->active, execlists->inflight);
 
 			WRITE_ONCE(execlists->pending[0], NULL);
+
+                        if (execlists->inflight[1]) {
+                                if (!atomic_xchg(&execlists->overload, 1))
+                                        intel_gt_pm_active_begin(engine->gt);
+                        } else {
+                                if (atomic_xchg(&execlists->overload, 0))
+                                        intel_gt_pm_active_end(engine->gt);
+                        }
 		} else {
 			GEM_BUG_ON(!*execlists->active);
 
 			/* port0 completed, advanced to port1 */
 			trace_ports(execlists, "completed", execlists->active);
 
+                        if (atomic_xchg(&execlists->overload, 0))
+                                intel_gt_pm_active_end(engine->gt);
+
 			/*
 			 * We rely on the hardware being strongly
 			 * ordered, that the breadcrumb write is
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

This allows CPUFREQ governors to realize when the system becomes
non-CPU-bound due to GPU rendering activity, and cause them to respond
more conservatively to the workload by limiting their response
frequency: CPU energy usage will be reduced when there isn't a good
chance for system performance to scale with CPU frequency due to the
GPU bottleneck.  This leaves additional TDP budget available for the
GPU to reach higher frequencies, which may be translated into an
improvement in graphics performance to the extent that the workload
remains TDP-limited.  If the workload isn't (anymore) TDP-limited
performance should stay roughly constant, but energy usage will be
divided by a similar factor.

The metric used by this patch in order to determine whether the GPU is
unlikely to be a bottleneck may not be particularly obvious: We only
specify a reduced PM QoS response frequency target whenever both
execlists ELSP ports are simultaneously active, since in that case we
know that the completion of one context will lead to the immediate
execution of another, which means that the GPU can be kept busy
without the execlists submission code rushing to submit a new request,
so CPU latency shouldn't be a concern.

This might miss some workloads that could theoretically benefit from
this optimization, since some applications are unable to keep both
ELSP ports active for a significant fraction of time even though they
are GPU-bound, however using the single-ELSP utilization as metric
would neglect the CPU latency-sensitivity of the execlists submission
code, which would lead to large regressions in x11perf and
jxrendermark.  For that reason this patch takes the rather
conservative approach of restricting the optimization to workloads
that effectively utilize both ELSP ports, which indicates that command
submission latency is unlikely to be an issue.

Note that this is currently only enabled for execlists submission.  It
might be beneficial to do the same thing in combination with GuC
submission, but the metric would be slightly different since we
wouldn't need to care about multiple ELSP ports being in use.

In order to further prevent regressions the optimization is enabled
with a delay in order to avoid performance degradation of applications
that quickly switch back and forth between being GPU-bound and
CPU-bound.  A reduced PM QoS scaling response frequency target will
only be specified if the GPU has been continuously utilized for a long
enough period of time.

v3: Assorted clean-ups (Chris).  Improved documentation (Tvrtko).  Fix
    interaction with single-ELSP preemption (Chris).  Move overload
    signalling to process_csb() to reduce bias due to interrupt
    latency (Francisco).  Rename CPU_RESPONSE_FREQUENCY to
    CPU_SCALING_RESPONSE (Rafael).  Adjust heuristic parameters to
    avoid regressions from other v3 governor changes (Francisco).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c    |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  11 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c        | 107 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h        |   3 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h     |  49 +++++++++
 drivers/gpu/drm/i915/gt/intel_lrc.c          |  17 +++
 6 files changed, 188 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index b1f8527f02c8..6b08a9ad2de1 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -517,6 +517,7 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine)
 
 	execlists->queue_priority_hint = INT_MIN;
 	execlists->queue = RB_ROOT_CACHED;
+	atomic_set(&execlists->overload, 0);
 }
 
 static void cleanup_status_page(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index bf395227c99f..9bdb3958dbb7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -275,6 +275,17 @@ struct intel_engine_execlists {
 	 */
 	u8 csb_head;
 
+	/**
+	 * @overload: whether at least two execlist ports are
+	 * currently submitted to the hardware, indicating that CPU
+	 * latency isn't critical in order to maintain the GPU busy.
+	 * We use that to trigger a more energy-efficient response
+	 * mode of CPU power management, since performance degradation
+	 * is unlikely under those conditions, and GPU throughput may
+	 * benefit from the increased TDP budget.
+	 */
+	atomic_t overload;
+
 	I915_SELFTEST_DECLARE(struct st_preempt_hang preempt_hang;)
 };
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 6bdb74892a1e..0d44ef3a07ad 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -107,6 +107,102 @@ void intel_gt_pm_init_early(struct intel_gt *gt)
 	intel_wakeref_init(&gt->wakeref, gt->uncore->rpm, &wf_ops);
 }
 
+/**
+ * Time increment until the most immediate PM QoS scaling response
+ * frequency update.
+ *
+ * May be in the future (return value > 0) if the GPU is currently
+ * active but we haven't updated the PM QoS request to reflect a
+ * bottleneck yet.  May be in the past (return value < 0) if the GPU
+ * isn't fully utilized and we've already reset the PM QoS request to
+ * the default value.  May be zero if a PM QoS request update is due.
+ *
+ * The time increment returned by this function decreases linearly
+ * with time until it reaches either zero or a configurable limit.
+ */
+static int32_t time_to_sf_qos_update_ns(struct intel_gt *gt)
+{
+	const uint64_t t1 = ktime_get_ns();
+	const uint64_t dt1 = gt->sf_qos.delay_max_ns;
+
+	if (atomic_read_acquire(&gt->sf_qos.active_count)) {
+		const uint64_t t0 = atomic64_read(&gt->sf_qos.time_set_ns);
+
+		return min(dt1, t0 <= t1 ? 0 : t0 - t1);
+	} else {
+		const uint64_t t0 = atomic64_read(&gt->sf_qos.time_clear_ns);
+		const unsigned int shift = gt->sf_qos.delay_slope_shift;
+
+		return -(int32_t)(t1 <= t0 ? 1 :
+				  min(dt1, (t1 - t0) << shift));
+	}
+}
+
+/**
+ * Perform a delayed PM QoS scaling response frequency update.
+ */
+static void intel_gt_sf_qos_update(struct intel_gt *gt)
+{
+	const uint32_t dt = max(0, time_to_sf_qos_update_ns(gt));
+
+	timer_reduce(&gt->sf_qos.timer, jiffies + nsecs_to_jiffies(dt));
+}
+
+/**
+ * Timer that fires once the delay used to switch the PM QoS scaling
+ * response frequency request has elapsed.
+ */
+static void intel_gt_sf_qos_timeout(struct timer_list *timer)
+{
+	struct intel_gt *gt = container_of(timer, struct intel_gt,
+					   sf_qos.timer);
+	const int32_t dt = time_to_sf_qos_update_ns(gt);
+
+	if (dt == 0)
+		cpu_scaling_response_qos_update_request(
+			&gt->sf_qos.req, gt->sf_qos.target_hz);
+	else
+		cpu_scaling_response_qos_update_request(
+			&gt->sf_qos.req, PM_QOS_DEFAULT_VALUE);
+
+	if (dt > 0)
+		intel_gt_sf_qos_update(gt);
+}
+
+/**
+ * Report the beginning of a period of GPU utilization to PM.
+ *
+ * May trigger a more energy-efficient response mode in CPU PM, but
+ * only after a certain delay has elapsed so we don't have a negative
+ * impact on the CPU ramp-up latency except after the GPU has been
+ * continuously utilized for a long enough period of time.
+ */
+void intel_gt_pm_active_begin(struct intel_gt *gt)
+{
+	const uint32_t dt = abs(time_to_sf_qos_update_ns(gt));
+
+	atomic64_set(&gt->sf_qos.time_set_ns, ktime_get_ns() + dt);
+
+	if (!atomic_fetch_inc_release(&gt->sf_qos.active_count))
+		intel_gt_sf_qos_update(gt);
+}
+
+/**
+ * Report the end of a period of GPU utilization to PM.
+ *
+ * Must be called once after each call to intel_gt_pm_active_begin().
+ */
+void intel_gt_pm_active_end(struct intel_gt *gt)
+{
+	const uint32_t dt = abs(time_to_sf_qos_update_ns(gt));
+	const unsigned int shift = gt->sf_qos.delay_slope_shift;
+
+	atomic64_set(&gt->sf_qos.time_clear_ns, ktime_get_ns() - (dt >> shift));
+
+	if (!atomic_dec_return_release(&gt->sf_qos.active_count))
+		intel_gt_sf_qos_update(gt);
+}
+
 void intel_gt_pm_init(struct intel_gt *gt)
 {
 	/*
@@ -116,6 +212,14 @@ void intel_gt_pm_init(struct intel_gt *gt)
 	 */
 	intel_rc6_init(&gt->rc6);
 	intel_rps_init(&gt->rps);
+
+	cpu_scaling_response_qos_add_request(&gt->sf_qos.req,
+					     PM_QOS_DEFAULT_VALUE);
+
+	gt->sf_qos.delay_max_ns = 10000000;
+	gt->sf_qos.delay_slope_shift = 1;
+	gt->sf_qos.target_hz = 2;
+	timer_setup(&gt->sf_qos.timer, intel_gt_sf_qos_timeout, 0);
 }
 
 static bool reset_engines(struct intel_gt *gt)
@@ -174,6 +278,9 @@ static void gt_sanitize(struct intel_gt *gt, bool force)
 
 void intel_gt_pm_fini(struct intel_gt *gt)
 {
+	del_timer_sync(&gt->sf_qos.timer);
+	cpu_scaling_response_qos_remove_request(&gt->sf_qos.req);
+
 	intel_rc6_fini(&gt->rc6);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 60f0e2fbe55c..43f1d45fb0db 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -58,6 +58,9 @@ int intel_gt_resume(struct intel_gt *gt);
 void intel_gt_runtime_suspend(struct intel_gt *gt);
 int intel_gt_runtime_resume(struct intel_gt *gt);
 
+void intel_gt_pm_active_begin(struct intel_gt *gt);
+void intel_gt_pm_active_end(struct intel_gt *gt);
+
 static inline bool is_mock_gt(const struct intel_gt *gt)
 {
 	return I915_SELFTEST_ONLY(gt->awake == -ENODEV);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 96890dd12b5f..8aaeb2450d05 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -10,6 +10,7 @@
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pm_qos.h>
 #include <linux/spinlock.h>
 #include <linux/types.h>
 
@@ -97,6 +98,54 @@ struct intel_gt {
 	 * Reserved for exclusive use by the kernel.
 	 */
 	struct i915_address_space *vm;
+
+	/**
+	 * CPU response frequency QoS tracking.
+	 */
+	struct {
+		/** PM QoS request of this device. */
+		struct pm_qos_request req;
+
+		/** Timer used for delayed update of the PM QoS request. */
+		struct timer_list timer;
+
+		/** Response frequency target to use in GPU-bound conditions. */
+		uint32_t target_hz;
+
+		/**
+		 * Maximum delay before the PM QoS request is updated
+		 * after we become GPU-bound.
+		 */
+		uint32_t delay_max_ns;
+
+		/**
+		 * Exponent of delay slope used when the workload
+		 * becomes non-GPU-bound, used to provide greater
+		 * sensitivity to periods of GPU inactivity which may
+		 * indicate that the workload is latency-bound.
+		 */
+		uint32_t delay_slope_shift;
+
+		/**
+		 * Last time intel_gt_pm_active_begin() was called to
+		 * indicate that the GPU is a bottleneck.
+		 */
+		atomic64_t time_set_ns;
+
+		/**
+		 * Last time intel_gt_pm_active_end() was called to
+		 * indicate that the GPU is no longer a bottleneck.
+		 */
+		atomic64_t time_clear_ns;
+
+		/**
+		 * Number of times intel_gt_pm_active_begin() was
+		 * called without a matching intel_gt_pm_active_end().
+		 * Will be greater than zero if the GPU is currently
+		 * considered to be a bottleneck.
+		 */
+		atomic_t active_count;
+	} sf_qos;
 };
 
 enum intel_gt_scratch_field {
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index be5d6b71b6b0..767fa88f4d20 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2365,6 +2365,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
 
 	smp_wmb(); /* complete the seqlock for execlists_active() */
 	WRITE_ONCE(execlists->active, execlists->inflight);
+
+	if (atomic_xchg(&execlists->overload, 0)) {
+		struct intel_engine_cs *engine =
+			container_of(execlists, typeof(*engine), execlists);
+		intel_gt_pm_active_end(engine->gt);
+	}
 }
 
 static inline void
@@ -2533,12 +2539,23 @@ static void process_csb(struct intel_engine_cs *engine)
 			WRITE_ONCE(execlists->active, execlists->inflight);
 
 			WRITE_ONCE(execlists->pending[0], NULL);
+
+                        if (execlists->inflight[1]) {
+                                if (!atomic_xchg(&execlists->overload, 1))
+                                        intel_gt_pm_active_begin(engine->gt);
+                        } else {
+                                if (atomic_xchg(&execlists->overload, 0))
+                                        intel_gt_pm_active_end(engine->gt);
+                        }
 		} else {
 			GEM_BUG_ON(!*execlists->active);
 
 			/* port0 completed, advanced to port1 */
 			trace_ports(execlists, "completed", execlists->active);
 
+                        if (atomic_xchg(&execlists->overload, 0))
+                                intel_gt_pm_active_end(engine->gt);
+
 			/*
 			 * We rely on the hardware being strongly
 			 * ordered, that the breadcrumb write is
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

v3: Rename CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 69 +++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index aa35a59f1c7d..16a45fd2c376 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1247,6 +1247,72 @@ static int i915_llc(struct seq_file *m, void *data)
 	return 0;
 }
 
+static int
+i915_sf_qos_delay_max_ns_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.sf_qos.delay_max_ns, val);
+	return 0;
+}
+
+static int
+i915_sf_qos_delay_max_ns_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.sf_qos.delay_max_ns);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_sf_qos_delay_max_ns_fops,
+			i915_sf_qos_delay_max_ns_get,
+			i915_sf_qos_delay_max_ns_set, "%llu\n");
+
+static int
+i915_sf_qos_delay_slope_shift_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.sf_qos.delay_slope_shift, val);
+	return 0;
+}
+
+static int
+i915_sf_qos_delay_slope_shift_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.sf_qos.delay_slope_shift);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_sf_qos_delay_slope_shift_fops,
+			i915_sf_qos_delay_slope_shift_get,
+			i915_sf_qos_delay_slope_shift_set, "%llu\n");
+
+static int
+i915_sf_qos_target_hz_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.sf_qos.target_hz, val);
+	return 0;
+}
+
+static int
+i915_sf_qos_target_hz_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.sf_qos.target_hz);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_sf_qos_target_hz_fops,
+			i915_sf_qos_target_hz_get,
+			i915_sf_qos_target_hz_set, "%llu\n");
+
 static int i915_runtime_pm_status(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -1882,6 +1948,9 @@ static const struct i915_debugfs_files {
 	{"i915_error_state", &i915_error_state_fops},
 	{"i915_gpu_info", &i915_gpu_info_fops},
 #endif
+	{"i915_sf_qos_delay_max_ns", &i915_sf_qos_delay_max_ns_fops},
+	{"i915_sf_qos_delay_slope_shift", &i915_sf_qos_delay_slope_shift_fops},
+	{"i915_sf_qos_target_hz", &i915_sf_qos_target_hz_fops}
 };
 
 void i915_debugfs_register(struct drm_i915_private *dev_priv)
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

v3: Rename CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 69 +++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index aa35a59f1c7d..16a45fd2c376 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1247,6 +1247,72 @@ static int i915_llc(struct seq_file *m, void *data)
 	return 0;
 }
 
+static int
+i915_sf_qos_delay_max_ns_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.sf_qos.delay_max_ns, val);
+	return 0;
+}
+
+static int
+i915_sf_qos_delay_max_ns_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.sf_qos.delay_max_ns);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_sf_qos_delay_max_ns_fops,
+			i915_sf_qos_delay_max_ns_get,
+			i915_sf_qos_delay_max_ns_set, "%llu\n");
+
+static int
+i915_sf_qos_delay_slope_shift_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.sf_qos.delay_slope_shift, val);
+	return 0;
+}
+
+static int
+i915_sf_qos_delay_slope_shift_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.sf_qos.delay_slope_shift);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_sf_qos_delay_slope_shift_fops,
+			i915_sf_qos_delay_slope_shift_get,
+			i915_sf_qos_delay_slope_shift_set, "%llu\n");
+
+static int
+i915_sf_qos_target_hz_set(void *data, u64 val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	WRITE_ONCE(dev_priv->gt.sf_qos.target_hz, val);
+	return 0;
+}
+
+static int
+i915_sf_qos_target_hz_get(void *data, u64 *val)
+{
+	struct drm_i915_private *dev_priv = data;
+
+	*val = READ_ONCE(dev_priv->gt.sf_qos.target_hz);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_sf_qos_target_hz_fops,
+			i915_sf_qos_target_hz_get,
+			i915_sf_qos_target_hz_set, "%llu\n");
+
 static int i915_runtime_pm_status(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -1882,6 +1948,9 @@ static const struct i915_debugfs_files {
 	{"i915_error_state", &i915_error_state_fops},
 	{"i915_gpu_info", &i915_gpu_info_fops},
 #endif
+	{"i915_sf_qos_delay_max_ns", &i915_sf_qos_delay_max_ns_fops},
+	{"i915_sf_qos_delay_slope_shift", &i915_sf_qos_delay_slope_shift_fops},
+	{"i915_sf_qos_target_hz", &i915_sf_qos_target_hz_fops}
 };
 
 void i915_debugfs_register(struct drm_i915_private *dev_priv)
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

This defines a generic policy in addition to the existing PERFORMANCE
and POWERSAVE policies.  The ADAPTIVE policy is expected to provide a
variable trade-off between performance and energy efficiency based on
the dynamic behavior of the workload -- E.g. whether the system has a
bottleneck on the CPU or another IO device.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/cpufreq.c | 10 ++++++++--
 drivers/cpufreq/longrun.c |  1 +
 include/linux/cpufreq.h   |  1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 045f9fe157ce..e87285b6294c 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -629,6 +629,9 @@ static unsigned int cpufreq_parse_policy(char *str_governor)
 	if (!strncasecmp(str_governor, "powersave", CPUFREQ_NAME_LEN))
 		return CPUFREQ_POLICY_POWERSAVE;
 
+	if (!strncasecmp(str_governor, "adaptive", CPUFREQ_NAME_LEN))
+		return CPUFREQ_POLICY_ADAPTIVE;
+
 	return CPUFREQ_POLICY_UNKNOWN;
 }
 
@@ -750,6 +753,8 @@ static ssize_t show_scaling_governor(struct cpufreq_policy *policy, char *buf)
 		return sprintf(buf, "powersave\n");
 	else if (policy->policy == CPUFREQ_POLICY_PERFORMANCE)
 		return sprintf(buf, "performance\n");
+	else if (policy->policy == CPUFREQ_POLICY_ADAPTIVE)
+		return sprintf(buf, "adaptive\n");
 	else if (policy->governor)
 		return scnprintf(buf, CPUFREQ_NAME_PLEN, "%s\n",
 				policy->governor->name);
@@ -811,7 +816,7 @@ static ssize_t show_scaling_available_governors(struct cpufreq_policy *policy,
 	struct cpufreq_governor *t;
 
 	if (!has_target()) {
-		i += sprintf(buf, "performance powersave");
+		i += sprintf(buf, "performance powersave adaptive");
 		goto out;
 	}
 
@@ -1085,7 +1090,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
 				pol = policy->policy;
 		}
 		if (pol != CPUFREQ_POLICY_PERFORMANCE &&
-		    pol != CPUFREQ_POLICY_POWERSAVE)
+		    pol != CPUFREQ_POLICY_POWERSAVE &&
+		    pol != CPUFREQ_POLICY_ADAPTIVE)
 			return -ENODATA;
 	}
 
diff --git a/drivers/cpufreq/longrun.c b/drivers/cpufreq/longrun.c
index 1caaec7c280b..cb70f0b7ff7a 100644
--- a/drivers/cpufreq/longrun.c
+++ b/drivers/cpufreq/longrun.c
@@ -99,6 +99,7 @@ static int longrun_set_policy(struct cpufreq_policy *policy)
 		msr_lo |= 0x00000001;
 		break;
 	case CPUFREQ_POLICY_POWERSAVE:
+	case CPUFREQ_POLICY_ADAPTIVE:
 		break;
 	}
 	wrmsr(MSR_TMTA_LONGRUN_FLAGS, msr_lo, msr_hi);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index f7240251a949..fa63df914f9f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -536,6 +536,7 @@ static inline unsigned long cpufreq_scale(unsigned long old, u_int div,
  */
 #define CPUFREQ_POLICY_POWERSAVE	(1)
 #define CPUFREQ_POLICY_PERFORMANCE	(2)
+#define CPUFREQ_POLICY_ADAPTIVE 	(3)
 
 /*
  * The polling frequency depends on the capability of the processor. Default
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

This defines a generic policy in addition to the existing PERFORMANCE
and POWERSAVE policies.  The ADAPTIVE policy is expected to provide a
variable trade-off between performance and energy efficiency based on
the dynamic behavior of the workload -- E.g. whether the system has a
bottleneck on the CPU or another IO device.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/cpufreq.c | 10 ++++++++--
 drivers/cpufreq/longrun.c |  1 +
 include/linux/cpufreq.h   |  1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 045f9fe157ce..e87285b6294c 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -629,6 +629,9 @@ static unsigned int cpufreq_parse_policy(char *str_governor)
 	if (!strncasecmp(str_governor, "powersave", CPUFREQ_NAME_LEN))
 		return CPUFREQ_POLICY_POWERSAVE;
 
+	if (!strncasecmp(str_governor, "adaptive", CPUFREQ_NAME_LEN))
+		return CPUFREQ_POLICY_ADAPTIVE;
+
 	return CPUFREQ_POLICY_UNKNOWN;
 }
 
@@ -750,6 +753,8 @@ static ssize_t show_scaling_governor(struct cpufreq_policy *policy, char *buf)
 		return sprintf(buf, "powersave\n");
 	else if (policy->policy == CPUFREQ_POLICY_PERFORMANCE)
 		return sprintf(buf, "performance\n");
+	else if (policy->policy == CPUFREQ_POLICY_ADAPTIVE)
+		return sprintf(buf, "adaptive\n");
 	else if (policy->governor)
 		return scnprintf(buf, CPUFREQ_NAME_PLEN, "%s\n",
 				policy->governor->name);
@@ -811,7 +816,7 @@ static ssize_t show_scaling_available_governors(struct cpufreq_policy *policy,
 	struct cpufreq_governor *t;
 
 	if (!has_target()) {
-		i += sprintf(buf, "performance powersave");
+		i += sprintf(buf, "performance powersave adaptive");
 		goto out;
 	}
 
@@ -1085,7 +1090,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
 				pol = policy->policy;
 		}
 		if (pol != CPUFREQ_POLICY_PERFORMANCE &&
-		    pol != CPUFREQ_POLICY_POWERSAVE)
+		    pol != CPUFREQ_POLICY_POWERSAVE &&
+		    pol != CPUFREQ_POLICY_ADAPTIVE)
 			return -ENODATA;
 	}
 
diff --git a/drivers/cpufreq/longrun.c b/drivers/cpufreq/longrun.c
index 1caaec7c280b..cb70f0b7ff7a 100644
--- a/drivers/cpufreq/longrun.c
+++ b/drivers/cpufreq/longrun.c
@@ -99,6 +99,7 @@ static int longrun_set_policy(struct cpufreq_policy *policy)
 		msr_lo |= 0x00000001;
 		break;
 	case CPUFREQ_POLICY_POWERSAVE:
+	case CPUFREQ_POLICY_ADAPTIVE:
 		break;
 	}
 	wrmsr(MSR_TMTA_LONGRUN_FLAGS, msr_lo, msr_hi);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index f7240251a949..fa63df914f9f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -536,6 +536,7 @@ static inline unsigned long cpufreq_scale(unsigned long old, u_int div,
  */
 #define CPUFREQ_POLICY_POWERSAVE	(1)
 #define CPUFREQ_POLICY_PERFORMANCE	(2)
+#define CPUFREQ_POLICY_ADAPTIVE 	(3)
 
 /*
  * The polling frequency depends on the capability of the processor. Default
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

Trivial.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 66ab6523c3eb..49401cfe9858 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2000,6 +2000,18 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
 	return 0;
 }
 
+static void intel_pstate_clear_update_util_hook(unsigned int cpu)
+{
+	struct cpudata *cpu_data = all_cpu_data[cpu];
+
+	if (!cpu_data->update_util_set)
+		return;
+
+	cpufreq_remove_update_util_hook(cpu);
+	cpu_data->update_util_set = false;
+	synchronize_rcu();
+}
+
 static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
@@ -2019,18 +2031,6 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 	cpu->update_util_set = true;
 }
 
-static void intel_pstate_clear_update_util_hook(unsigned int cpu)
-{
-	struct cpudata *cpu_data = all_cpu_data[cpu];
-
-	if (!cpu_data->update_util_set)
-		return;
-
-	cpufreq_remove_update_util_hook(cpu);
-	cpu_data->update_util_set = false;
-	synchronize_rcu();
-}
-
 static int intel_pstate_get_max_freq(struct cpudata *cpu)
 {
 	return global.turbo_disabled || global.no_turbo ?
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

Trivial.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 66ab6523c3eb..49401cfe9858 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2000,6 +2000,18 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
 	return 0;
 }
 
+static void intel_pstate_clear_update_util_hook(unsigned int cpu)
+{
+	struct cpudata *cpu_data = all_cpu_data[cpu];
+
+	if (!cpu_data->update_util_set)
+		return;
+
+	cpufreq_remove_update_util_hook(cpu);
+	cpu_data->update_util_set = false;
+	synchronize_rcu();
+}
+
 static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
@@ -2019,18 +2031,6 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 	cpu->update_util_set = true;
 }
 
-static void intel_pstate_clear_update_util_hook(unsigned int cpu)
-{
-	struct cpudata *cpu_data = all_cpu_data[cpu];
-
-	if (!cpu_data->update_util_set)
-		return;
-
-	cpufreq_remove_update_util_hook(cpu);
-	cpu_data->update_util_set = false;
-	synchronize_rcu();
-}
-
 static int intel_pstate_get_max_freq(struct cpudata *cpu)
 {
 	return global.turbo_disabled || global.no_turbo ?
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

And let it figure out whether an update_util hook is needed, and what
the appropriate function pointer is based on the CPUFREQ policy of the
current CPU.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 49401cfe9858..fd7eee57c05c 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2016,10 +2016,11 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
 
-	if (hwp_active && !hwp_boost)
-		return;
-
 	if (cpu->update_util_set)
+		intel_pstate_clear_update_util_hook(cpu_num);
+
+	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE ||
+	    (hwp_active && !hwp_boost))
 		return;
 
 	/* Prevent intel_pstate_update_util() from using stale data. */
@@ -2117,27 +2118,18 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 
 	intel_pstate_update_perf_limits(cpu, policy->min, policy->max);
 
+	intel_pstate_set_update_util_hook(policy->cpu);
+
 	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE) {
 		/*
 		 * NOHZ_FULL CPUs need this as the governor callback may not
 		 * be invoked on them.
 		 */
-		intel_pstate_clear_update_util_hook(policy->cpu);
 		intel_pstate_max_within_limits(cpu);
-	} else {
-		intel_pstate_set_update_util_hook(policy->cpu);
 	}
 
-	if (hwp_active) {
-		/*
-		 * When hwp_boost was active before and dynamically it
-		 * was turned off, in that case we need to clear the
-		 * update util hook.
-		 */
-		if (!hwp_boost)
-			intel_pstate_clear_update_util_hook(policy->cpu);
+	if (hwp_active)
 		intel_pstate_hwp_set(policy->cpu);
-	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

And let it figure out whether an update_util hook is needed, and what
the appropriate function pointer is based on the CPUFREQ policy of the
current CPU.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 49401cfe9858..fd7eee57c05c 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2016,10 +2016,11 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
 
-	if (hwp_active && !hwp_boost)
-		return;
-
 	if (cpu->update_util_set)
+		intel_pstate_clear_update_util_hook(cpu_num);
+
+	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE ||
+	    (hwp_active && !hwp_boost))
 		return;
 
 	/* Prevent intel_pstate_update_util() from using stale data. */
@@ -2117,27 +2118,18 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 
 	intel_pstate_update_perf_limits(cpu, policy->min, policy->max);
 
+	intel_pstate_set_update_util_hook(policy->cpu);
+
 	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE) {
 		/*
 		 * NOHZ_FULL CPUs need this as the governor callback may not
 		 * be invoked on them.
 		 */
-		intel_pstate_clear_update_util_hook(policy->cpu);
 		intel_pstate_max_within_limits(cpu);
-	} else {
-		intel_pstate_set_update_util_hook(policy->cpu);
 	}
 
-	if (hwp_active) {
-		/*
-		 * When hwp_boost was active before and dynamically it
-		 * was turned off, in that case we need to clear the
-		 * update util hook.
-		 */
-		if (!hwp_boost)
-			intel_pstate_clear_update_util_hook(policy->cpu);
+	if (hwp_active)
 		intel_pstate_hwp_set(policy->cpu);
-	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

The goal of the statistics calculation code introduced here is to
compute two informational data structures: struct vlp_input_stats
aggregating various scheduling and PM statistics gathered in every
call of the update_util() hook, and struct vlp_status_sample which
contains status information derived from the former indicating whether
the system is likely to have an IO or CPU bottleneck.  This will be
used as main heuristic input by the new variably low-pass filtering
controller (AKA VLP) that will assist the HWP at finding a reasonably
energy-efficient P-state given the additional information available to
the kernel about I/O utilization and scheduling behavior.

Then a function is introduced (get_vlp_target_range()) which
calculates a P-state range derived from those statistics, which will
be used to drive the HWP P-state range or (if HWP is not available) as
basis for some additional kernel-side frequency selection mechanism
which will choose a single P-state from the range.  This is meant to
provide a variably low-pass filtering effect that will damp
oscillations below a frequency threshold that can be specified by
device drivers via PM QoS in order to achieve energy-efficient
behavior in cases where the system has an IO bottleneck.

v3: Rename CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).
    Merge patch with statistics calculations with patch which computes
    a derived P-state target based on their result (Rafael).  Define
    generic CPUFREQ policy to control VLP governor (Rafael).  Adjust
    realtime_gain_pml to avoid fluctuation of minimum P-state on HWP
    parts (Srinivas).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 398 +++++++++++++++++++++++++++++++++
 1 file changed, 398 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index fd7eee57c05c..d13eb95a6271 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -19,6 +19,7 @@
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/debugfs.h>
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/fs.h>
@@ -33,6 +34,8 @@
 #include <asm/cpufeature.h>
 #include <asm/intel-family.h>
 
+#include "../../kernel/sched/sched.h"
+
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
@@ -59,6 +62,14 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
 	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
 }
 
+/**
+ * Round a fixed-point fraction to the closest integer.
+ */
+static inline int rnd_fp(int32_t x)
+{
+	return (x + (1 << (FRAC_BITS - 1))) >> FRAC_BITS;
+}
+
 static inline int32_t div_fp(s64 x, s64 y)
 {
 	return div64_s64((int64_t)x << FRAC_BITS, y);
@@ -169,6 +180,66 @@ struct vid_data {
 	int32_t ratio;
 };
 
+/**
+ * Scheduling and PM statistics gathered by update_vlp_sample() at
+ * every call of the VLP update_state() hook, used as heuristic
+ * inputs.
+ */
+struct vlp_input_stats {
+	int32_t realtime_count;
+	int32_t io_wait_count;
+	uint32_t max_scaling_response_hz;
+	uint32_t last_scaling_response_hz;
+};
+
+enum vlp_status {
+	VLP_BOTTLENECK_IO = 1 << 0,
+	/*
+	 * XXX - Add other status bits here indicating a CPU or TDP
+	 * bottleneck.
+	 */
+};
+
+/**
+ * Heuristic status information calculated by get_vlp_status_sample()
+ * from struct vlp_input_stats above, indicating whether the system
+ * has a potential IO or latency bottleneck.
+ */
+struct vlp_status_sample {
+	enum vlp_status value;
+	int32_t realtime_avg;
+};
+
+/**
+ * VLP controller state used for the estimation of the target P-state
+ * range, computed by get_vlp_target_range() from the heuristic status
+ * information defined above in struct vlp_status_sample.
+ */
+struct vlp_target_range {
+	unsigned int value[2];
+	int32_t p_base;
+};
+
+/**
+ * struct vlp_data - VLP controller parameters and state.
+ * @sample_interval_ns:	 Update interval in ns.
+ * @sample_frequency_hz: Reciprocal of the update interval in Hz.
+ * @gain*:		 Response factor of the controller relative to each
+ *			 one of its linear input variables as fixed-point
+ *			 fraction.
+ */
+struct vlp_data {
+	s64 sample_interval_ns;
+	int32_t sample_frequency_hz;
+	int32_t gain_aggr;
+	int32_t gain_rt;
+	int32_t gain;
+
+	struct vlp_input_stats stats;
+	struct vlp_status_sample status;
+	struct vlp_target_range target;
+};
+
 /**
  * struct global_params - Global parameters, mostly tunable via sysfs.
  * @no_turbo:		Whether or not to use turbo P-states.
@@ -239,6 +310,7 @@ struct cpudata {
 
 	struct pstate_data pstate;
 	struct vid_data vid;
+	struct vlp_data vlp;
 
 	u64	last_update;
 	u64	last_sample_time;
@@ -268,6 +340,24 @@ struct cpudata {
 
 static struct cpudata **all_cpu_data;
 
+/**
+ * struct vlp_params - VLP controller static configuration
+ * @sample_interval_ms:	     Update interval in ms.
+ * @setpoint_*_pml:	     Target CPU utilization at which the controller is
+ *			     expected to leave the current P-state untouched,
+ *			     as an integer per mille.
+ * @avg*_hz:		     Exponential averaging frequencies of the various
+ *			     low-pass filters as an integer in Hz.
+ */
+struct vlp_params {
+	int sample_interval_ms;
+	int setpoint_0_pml;
+	int setpoint_aggr_pml;
+	int avg_hz;
+	int realtime_gain_pml;
+	int debug;
+};
+
 /**
  * struct pstate_funcs - Per CPU model specific callbacks
  * @get_max:		Callback to get maximum non turbo effective P state
@@ -293,6 +383,14 @@ struct pstate_funcs {
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
+static struct vlp_params vlp_params __read_mostly = {
+	.sample_interval_ms = 10,
+	.setpoint_0_pml = 900,
+	.setpoint_aggr_pml = 1500,
+	.avg_hz = 2,
+	.realtime_gain_pml = 0,
+	.debug = 0,
+};
 
 static int hwp_active __read_mostly;
 static int hwp_mode_bdw __read_mostly;
@@ -1790,6 +1888,306 @@ static inline int32_t get_target_pstate(struct cpudata *cpu)
 	return target;
 }
 
+/**
+ * Initialize the struct vlp_data of the specified CPU to the defaults
+ * calculated from @vlp_params.
+ */
+static void intel_pstate_reset_vlp(struct cpudata *cpu)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+
+	if (cpu->policy != CPUFREQ_POLICY_ADAPTIVE)
+		return;
+
+	vlp->sample_interval_ns = vlp_params.sample_interval_ms * NSEC_PER_MSEC;
+
+	if (vlp_params.sample_interval_ms)
+		vlp->sample_frequency_hz = max(1u, (uint32_t)MSEC_PER_SEC /
+						   vlp_params.sample_interval_ms);
+	else
+		vlp->sample_frequency_hz = 0;
+
+	vlp->gain_rt = div_fp(cpu->pstate.max_pstate *
+			      vlp_params.realtime_gain_pml, 1000);
+	vlp->gain_aggr = max(1, div_fp(1000, vlp_params.setpoint_aggr_pml));
+	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
+	vlp->target.p_base = 0;
+	vlp->stats.last_scaling_response_hz = vlp_params.avg_hz;
+}
+
+/**
+ * Fixed point representation with twice the usual number of
+ * fractional bits.
+ */
+#define DFRAC_BITS 16
+#define DFRAC_ONE (1 << DFRAC_BITS)
+#define DFRAC_MAX_INT (0u - (uint32_t)DFRAC_ONE)
+
+/**
+ * Fast but rather inaccurate piecewise-linear approximation of a
+ * fixed-point inverse exponential:
+ *
+ *  exp2n(p) = int_tofp(1) * 2 ^ (-p / DFRAC_ONE) + O(1)
+ *
+ * The error term should be lower in magnitude than 0.044.
+ */
+static int32_t exp2n(uint32_t p)
+{
+	if (p < 32 * DFRAC_ONE) {
+		/* Interpolate between 2^-floor(p) and 2^-ceil(p). */
+		const uint32_t floor_p = p >> DFRAC_BITS;
+		const uint32_t ceil_p = (p + DFRAC_ONE - 1) >> DFRAC_BITS;
+		const uint64_t frac_p = p - (floor_p << DFRAC_BITS);
+
+		return ((int_tofp(1) >> floor_p) * (DFRAC_ONE - frac_p) +
+			(ceil_p >= 32 ? 0 : int_tofp(1) >> ceil_p) * frac_p) >>
+			DFRAC_BITS;
+	}
+
+	/* Short-circuit to avoid overflow. */
+	return 0;
+}
+
+/**
+ * Calculate the exponential averaging weight for a new sample based
+ * on the requested averaging frequency @hz and the delay since the
+ * last update.
+ */
+static int32_t get_last_sample_avg_weight(struct cpudata *cpu, unsigned int hz)
+{
+	/*
+	 * Approximate, but saves several 64-bit integer divisions
+	 * below and should be fully evaluated at compile-time.
+	 * Causes the exponential averaging to have an effective base
+	 * of 1.90702343749, which has little functional implications
+	 * as long as the hz parameter is scaled accordingly.
+	 */
+	const uint32_t ns_per_s_shift = order_base_2(NSEC_PER_SEC);
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+
+	return exp2n(min((uint64_t)DFRAC_MAX_INT,
+			 (hz * delta_ns) >> (ns_per_s_shift - DFRAC_BITS)));
+}
+
+/**
+ * Calculate some status information heuristically based on the struct
+ * vlp_input_stats statistics gathered by the update_state() hook.
+ */
+static const struct vlp_status_sample *get_vlp_status_sample(
+	struct cpudata *cpu, const int32_t po)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+	struct vlp_input_stats *stats = &vlp->stats;
+	struct vlp_status_sample *last_status = &vlp->status;
+
+	/*
+	 * Calculate the VLP_BOTTLENECK_IO state bit, which indicates
+	 * whether some IO device driver has requested a PM response
+	 * frequency bound, typically due to the device being under
+	 * close to full utilization, which should cause the
+	 * controller to make a more conservative trade-off between
+	 * latency and energy usage, since performance isn't
+	 * guaranteed to scale further with increasing CPU frequency
+	 * whenever the system is close to IO-bound.
+	 *
+	 * Note that the maximum achievable response frequency is
+	 * limited by the sampling frequency of the controller,
+	 * response frequency requests greater than that will be
+	 * promoted to infinity (i.e. no low-pass filtering) in order
+	 * to avoid violating the response frequency constraint
+	 * provided via PM QoS.
+	 */
+	const bool bottleneck_io = stats->max_scaling_response_hz <
+				   vlp->sample_frequency_hz;
+
+	/*
+	 * Calculate the realtime statistic that tracks the
+	 * exponentially-averaged rate of occurrence of
+	 * latency-sensitive events (like wake-ups from IO wait).
+	 */
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+	const int32_t realtime_sample =
+		div_fp((uint64_t)(stats->realtime_count +
+				  (bottleneck_io ? 0 : stats->io_wait_count)) *
+		       NSEC_PER_SEC,
+		       100 * delta_ns);
+	const int32_t alpha = get_last_sample_avg_weight(cpu,
+							 vlp_params.avg_hz);
+	const int32_t realtime_avg = realtime_sample +
+		mul_fp(alpha, last_status->realtime_avg - realtime_sample);
+
+	/* Consume the input statistics. */
+	stats->io_wait_count = 0;
+	stats->realtime_count = 0;
+	if (bottleneck_io)
+		stats->last_scaling_response_hz =
+			stats->max_scaling_response_hz;
+	stats->max_scaling_response_hz = 0;
+
+	/* Update the state of the controller. */
+	last_status->realtime_avg = realtime_avg;
+	last_status->value = (bottleneck_io ? VLP_BOTTLENECK_IO : 0);
+
+	/* Update state used for tracing. */
+	cpu->sample.busy_scaled = int_tofp(stats->max_scaling_response_hz);
+	cpu->iowait_boost = realtime_avg;
+
+	return last_status;
+}
+
+/**
+ * Calculate the target P-state range for the next update period.
+ * Uses a variably low-pass-filtering controller intended to improve
+ * energy efficiency when a CPU response frequency target is specified
+ * via PM QoS (e.g. under IO-bound conditions).
+ */
+static const struct vlp_target_range *get_vlp_target_range(struct cpudata *cpu)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+	struct vlp_target_range *last_target = &vlp->target;
+
+	/*
+	 * P-state limits in fixed-point as allowed by the policy.
+	 */
+	const int32_t p0 = int_tofp(max(cpu->pstate.min_pstate,
+					cpu->min_perf_ratio));
+	const int32_t p1 = int_tofp(cpu->max_perf_ratio);
+
+	/*
+	 * Observed average P-state during the sampling period.	 The
+	 * conservative path (po_cons) uses the TSC increment as
+	 * denominator which will give the minimum (arguably most
+	 * energy-efficient) P-state able to accomplish the observed
+	 * amount of work during the sampling period.
+	 *
+	 * The downside of that somewhat optimistic estimate is that
+	 * it can give a biased result for intermittent
+	 * latency-sensitive workloads, which may have to be completed
+	 * in a short window of time for the system to achieve maximum
+	 * performance, even if the average CPU utilization is low.
+	 * For that reason the aggressive path (po_aggr) uses the
+	 * MPERF increment as denominator, which is approximately
+	 * optimal under the pessimistic assumption that the CPU work
+	 * cannot be parallelized with any other dependent IO work
+	 * that subsequently keeps the CPU idle (partly in C1+
+	 * states).
+	 */
+	const int32_t po_cons =
+		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
+		       * cpu->pstate.max_pstate_physical,
+		       cpu->sample.tsc);
+	const int32_t po_aggr =
+		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
+		       * cpu->pstate.max_pstate_physical,
+		       (cpu->sample.mperf << cpu->aperf_mperf_shift));
+
+	const struct vlp_status_sample *status =
+		get_vlp_status_sample(cpu, po_cons);
+
+	/* Calculate the target P-state. */
+	const int32_t p_tgt_cons = mul_fp(vlp->gain, po_cons);
+	const int32_t p_tgt_aggr = mul_fp(vlp->gain_aggr, po_aggr);
+	const int32_t p_tgt = max(p0, min(p1, max(p_tgt_cons, p_tgt_aggr)));
+
+	/* Calculate the realtime P-state target lower bound. */
+	const int32_t pm = int_tofp(cpu->pstate.max_pstate);
+	const int32_t p_tgt_rt = min(pm, mul_fp(vlp->gain_rt,
+						status->realtime_avg));
+
+	/*
+	 * Low-pass filter the P-state estimate above by exponential
+	 * averaging.  For an oscillating workload (e.g. submitting
+	 * work repeatedly to a device like a soundcard or GPU) this
+	 * will approximate the minimum P-state that would be able to
+	 * accomplish the observed amount of work during the averaging
+	 * period, which is also the optimally energy-efficient one,
+	 * under the assumptions that:
+	 *
+	 *  - The power curve of the system is convex throughout the
+	 *    range of P-states allowed by the policy. I.e. energy
+	 *    efficiency is steadily decreasing with frequency past p0
+	 *    (which is typically close to the maximum-efficiency
+	 *    ratio).  In practice for the lower range of P-states
+	 *    this may only be approximately true due to the
+	 *    interaction between different components of the system.
+	 *
+	 *  - Parallelism constraints of the workload don't prevent it
+	 *    from achieving the same throughput at the lower P-state.
+	 *    This will happen in cases where the application is
+	 *    designed in a way that doesn't allow for dependent CPU
+	 *    and IO jobs to be pipelined, leading to alternating full
+	 *    and zero utilization of the CPU and IO device.  This
+	 *    will give an average IO device utilization lower than
+	 *    100% regardless of the CPU frequency, which should
+	 *    prevent the device driver from requesting a response
+	 *    frequency bound, so the filtered P-state calculated
+	 *    below won't have an influence on the controller
+	 *    response.
+	 *
+	 *  - The period of the oscillating workload is significantly
+	 *    shorter than the time constant of the exponential
+	 *    average (1s / last_scaling_response_hz).  Otherwise for
+	 *    more slowly oscillating workloads the controller
+	 *    response will roughly follow the oscillation, leading to
+	 *    decreased energy efficiency.
+	 *
+	 *  - The behavior of the workload doesn't change
+	 *    qualitatively during the next update interval.  This is
+	 *    only true in the steady state, and could possibly lead
+	 *    to a transitory period in which the controller response
+	 *    deviates from the most energy-efficient ratio until the
+	 *    workload reaches a steady state again.
+	 */
+	const int32_t alpha = get_last_sample_avg_weight(
+		cpu, vlp->stats.last_scaling_response_hz);
+
+	last_target->p_base = p_tgt + mul_fp(alpha,
+					     last_target->p_base - p_tgt);
+
+	/*
+	 * Use the low-pass-filtered controller response for better
+	 * energy efficiency unless we have reasons to believe that
+	 * some of the optimality assumptions discussed above may not
+	 * hold.
+	 */
+	if ((status->value & VLP_BOTTLENECK_IO)) {
+		last_target->value[0] = rnd_fp(p0);
+		last_target->value[1] = rnd_fp(last_target->p_base);
+	} else {
+		last_target->value[0] = rnd_fp(p_tgt_rt);
+		last_target->value[1] = rnd_fp(p1);
+	}
+
+	return last_target;
+}
+
+/**
+ * Collect some scheduling and PM statistics in response to an
+ * update_state() call.
+ */
+static bool update_vlp_sample(struct cpudata *cpu, u64 time, unsigned int flags)
+{
+	struct vlp_input_stats *stats = &cpu->vlp.stats;
+
+	/* Update PM QoS scaling response frequency request. */
+	const uint32_t resp_hz = cpu_scaling_response_qos_limit();
+
+	stats->max_scaling_response_hz = !resp_hz ? UINT_MAX :
+		max(stats->max_scaling_response_hz, resp_hz);
+
+	/* Update scheduling statistics. */
+	if ((flags & SCHED_CPUFREQ_IOWAIT))
+		stats->io_wait_count++;
+
+	if (cpu_rq(cpu->cpu)->rt.rt_nr_running)
+		stats->realtime_count++;
+
+	/* Return whether a P-state update is due. */
+	return smp_processor_id() == cpu->cpu &&
+		time - cpu->sample.time >= cpu->vlp.sample_interval_ns &&
+		intel_pstate_sample(cpu, time);
+}
+
 static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
 {
 	int min_pstate = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

The goal of the statistics calculation code introduced here is to
compute two informational data structures: struct vlp_input_stats
aggregating various scheduling and PM statistics gathered in every
call of the update_util() hook, and struct vlp_status_sample which
contains status information derived from the former indicating whether
the system is likely to have an IO or CPU bottleneck.  This will be
used as main heuristic input by the new variably low-pass filtering
controller (AKA VLP) that will assist the HWP at finding a reasonably
energy-efficient P-state given the additional information available to
the kernel about I/O utilization and scheduling behavior.

Then a function is introduced (get_vlp_target_range()) which
calculates a P-state range derived from those statistics, which will
be used to drive the HWP P-state range or (if HWP is not available) as
basis for some additional kernel-side frequency selection mechanism
which will choose a single P-state from the range.  This is meant to
provide a variably low-pass filtering effect that will damp
oscillations below a frequency threshold that can be specified by
device drivers via PM QoS in order to achieve energy-efficient
behavior in cases where the system has an IO bottleneck.

v3: Rename CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).
    Merge patch with statistics calculations with patch which computes
    a derived P-state target based on their result (Rafael).  Define
    generic CPUFREQ policy to control VLP governor (Rafael).  Adjust
    realtime_gain_pml to avoid fluctuation of minimum P-state on HWP
    parts (Srinivas).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 398 +++++++++++++++++++++++++++++++++
 1 file changed, 398 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index fd7eee57c05c..d13eb95a6271 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -19,6 +19,7 @@
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/debugfs.h>
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/fs.h>
@@ -33,6 +34,8 @@
 #include <asm/cpufeature.h>
 #include <asm/intel-family.h>
 
+#include "../../kernel/sched/sched.h"
+
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
@@ -59,6 +62,14 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
 	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
 }
 
+/**
+ * Round a fixed-point fraction to the closest integer.
+ */
+static inline int rnd_fp(int32_t x)
+{
+	return (x + (1 << (FRAC_BITS - 1))) >> FRAC_BITS;
+}
+
 static inline int32_t div_fp(s64 x, s64 y)
 {
 	return div64_s64((int64_t)x << FRAC_BITS, y);
@@ -169,6 +180,66 @@ struct vid_data {
 	int32_t ratio;
 };
 
+/**
+ * Scheduling and PM statistics gathered by update_vlp_sample() at
+ * every call of the VLP update_state() hook, used as heuristic
+ * inputs.
+ */
+struct vlp_input_stats {
+	int32_t realtime_count;
+	int32_t io_wait_count;
+	uint32_t max_scaling_response_hz;
+	uint32_t last_scaling_response_hz;
+};
+
+enum vlp_status {
+	VLP_BOTTLENECK_IO = 1 << 0,
+	/*
+	 * XXX - Add other status bits here indicating a CPU or TDP
+	 * bottleneck.
+	 */
+};
+
+/**
+ * Heuristic status information calculated by get_vlp_status_sample()
+ * from struct vlp_input_stats above, indicating whether the system
+ * has a potential IO or latency bottleneck.
+ */
+struct vlp_status_sample {
+	enum vlp_status value;
+	int32_t realtime_avg;
+};
+
+/**
+ * VLP controller state used for the estimation of the target P-state
+ * range, computed by get_vlp_target_range() from the heuristic status
+ * information defined above in struct vlp_status_sample.
+ */
+struct vlp_target_range {
+	unsigned int value[2];
+	int32_t p_base;
+};
+
+/**
+ * struct vlp_data - VLP controller parameters and state.
+ * @sample_interval_ns:	 Update interval in ns.
+ * @sample_frequency_hz: Reciprocal of the update interval in Hz.
+ * @gain*:		 Response factor of the controller relative to each
+ *			 one of its linear input variables as fixed-point
+ *			 fraction.
+ */
+struct vlp_data {
+	s64 sample_interval_ns;
+	int32_t sample_frequency_hz;
+	int32_t gain_aggr;
+	int32_t gain_rt;
+	int32_t gain;
+
+	struct vlp_input_stats stats;
+	struct vlp_status_sample status;
+	struct vlp_target_range target;
+};
+
 /**
  * struct global_params - Global parameters, mostly tunable via sysfs.
  * @no_turbo:		Whether or not to use turbo P-states.
@@ -239,6 +310,7 @@ struct cpudata {
 
 	struct pstate_data pstate;
 	struct vid_data vid;
+	struct vlp_data vlp;
 
 	u64	last_update;
 	u64	last_sample_time;
@@ -268,6 +340,24 @@ struct cpudata {
 
 static struct cpudata **all_cpu_data;
 
+/**
+ * struct vlp_params - VLP controller static configuration
+ * @sample_interval_ms:	     Update interval in ms.
+ * @setpoint_*_pml:	     Target CPU utilization at which the controller is
+ *			     expected to leave the current P-state untouched,
+ *			     as an integer per mille.
+ * @avg*_hz:		     Exponential averaging frequencies of the various
+ *			     low-pass filters as an integer in Hz.
+ */
+struct vlp_params {
+	int sample_interval_ms;
+	int setpoint_0_pml;
+	int setpoint_aggr_pml;
+	int avg_hz;
+	int realtime_gain_pml;
+	int debug;
+};
+
 /**
  * struct pstate_funcs - Per CPU model specific callbacks
  * @get_max:		Callback to get maximum non turbo effective P state
@@ -293,6 +383,14 @@ struct pstate_funcs {
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
+static struct vlp_params vlp_params __read_mostly = {
+	.sample_interval_ms = 10,
+	.setpoint_0_pml = 900,
+	.setpoint_aggr_pml = 1500,
+	.avg_hz = 2,
+	.realtime_gain_pml = 0,
+	.debug = 0,
+};
 
 static int hwp_active __read_mostly;
 static int hwp_mode_bdw __read_mostly;
@@ -1790,6 +1888,306 @@ static inline int32_t get_target_pstate(struct cpudata *cpu)
 	return target;
 }
 
+/**
+ * Initialize the struct vlp_data of the specified CPU to the defaults
+ * calculated from @vlp_params.
+ */
+static void intel_pstate_reset_vlp(struct cpudata *cpu)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+
+	if (cpu->policy != CPUFREQ_POLICY_ADAPTIVE)
+		return;
+
+	vlp->sample_interval_ns = vlp_params.sample_interval_ms * NSEC_PER_MSEC;
+
+	if (vlp_params.sample_interval_ms)
+		vlp->sample_frequency_hz = max(1u, (uint32_t)MSEC_PER_SEC /
+						   vlp_params.sample_interval_ms);
+	else
+		vlp->sample_frequency_hz = 0;
+
+	vlp->gain_rt = div_fp(cpu->pstate.max_pstate *
+			      vlp_params.realtime_gain_pml, 1000);
+	vlp->gain_aggr = max(1, div_fp(1000, vlp_params.setpoint_aggr_pml));
+	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
+	vlp->target.p_base = 0;
+	vlp->stats.last_scaling_response_hz = vlp_params.avg_hz;
+}
+
+/**
+ * Fixed point representation with twice the usual number of
+ * fractional bits.
+ */
+#define DFRAC_BITS 16
+#define DFRAC_ONE (1 << DFRAC_BITS)
+#define DFRAC_MAX_INT (0u - (uint32_t)DFRAC_ONE)
+
+/**
+ * Fast but rather inaccurate piecewise-linear approximation of a
+ * fixed-point inverse exponential:
+ *
+ *  exp2n(p) = int_tofp(1) * 2 ^ (-p / DFRAC_ONE) + O(1)
+ *
+ * The error term should be lower in magnitude than 0.044.
+ */
+static int32_t exp2n(uint32_t p)
+{
+	if (p < 32 * DFRAC_ONE) {
+		/* Interpolate between 2^-floor(p) and 2^-ceil(p). */
+		const uint32_t floor_p = p >> DFRAC_BITS;
+		const uint32_t ceil_p = (p + DFRAC_ONE - 1) >> DFRAC_BITS;
+		const uint64_t frac_p = p - (floor_p << DFRAC_BITS);
+
+		return ((int_tofp(1) >> floor_p) * (DFRAC_ONE - frac_p) +
+			(ceil_p >= 32 ? 0 : int_tofp(1) >> ceil_p) * frac_p) >>
+			DFRAC_BITS;
+	}
+
+	/* Short-circuit to avoid overflow. */
+	return 0;
+}
+
+/**
+ * Calculate the exponential averaging weight for a new sample based
+ * on the requested averaging frequency @hz and the delay since the
+ * last update.
+ */
+static int32_t get_last_sample_avg_weight(struct cpudata *cpu, unsigned int hz)
+{
+	/*
+	 * Approximate, but saves several 64-bit integer divisions
+	 * below and should be fully evaluated at compile-time.
+	 * Causes the exponential averaging to have an effective base
+	 * of 1.90702343749, which has little functional implications
+	 * as long as the hz parameter is scaled accordingly.
+	 */
+	const uint32_t ns_per_s_shift = order_base_2(NSEC_PER_SEC);
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+
+	return exp2n(min((uint64_t)DFRAC_MAX_INT,
+			 (hz * delta_ns) >> (ns_per_s_shift - DFRAC_BITS)));
+}
+
+/**
+ * Calculate some status information heuristically based on the struct
+ * vlp_input_stats statistics gathered by the update_state() hook.
+ */
+static const struct vlp_status_sample *get_vlp_status_sample(
+	struct cpudata *cpu, const int32_t po)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+	struct vlp_input_stats *stats = &vlp->stats;
+	struct vlp_status_sample *last_status = &vlp->status;
+
+	/*
+	 * Calculate the VLP_BOTTLENECK_IO state bit, which indicates
+	 * whether some IO device driver has requested a PM response
+	 * frequency bound, typically due to the device being under
+	 * close to full utilization, which should cause the
+	 * controller to make a more conservative trade-off between
+	 * latency and energy usage, since performance isn't
+	 * guaranteed to scale further with increasing CPU frequency
+	 * whenever the system is close to IO-bound.
+	 *
+	 * Note that the maximum achievable response frequency is
+	 * limited by the sampling frequency of the controller,
+	 * response frequency requests greater than that will be
+	 * promoted to infinity (i.e. no low-pass filtering) in order
+	 * to avoid violating the response frequency constraint
+	 * provided via PM QoS.
+	 */
+	const bool bottleneck_io = stats->max_scaling_response_hz <
+				   vlp->sample_frequency_hz;
+
+	/*
+	 * Calculate the realtime statistic that tracks the
+	 * exponentially-averaged rate of occurrence of
+	 * latency-sensitive events (like wake-ups from IO wait).
+	 */
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+	const int32_t realtime_sample =
+		div_fp((uint64_t)(stats->realtime_count +
+				  (bottleneck_io ? 0 : stats->io_wait_count)) *
+		       NSEC_PER_SEC,
+		       100 * delta_ns);
+	const int32_t alpha = get_last_sample_avg_weight(cpu,
+							 vlp_params.avg_hz);
+	const int32_t realtime_avg = realtime_sample +
+		mul_fp(alpha, last_status->realtime_avg - realtime_sample);
+
+	/* Consume the input statistics. */
+	stats->io_wait_count = 0;
+	stats->realtime_count = 0;
+	if (bottleneck_io)
+		stats->last_scaling_response_hz =
+			stats->max_scaling_response_hz;
+	stats->max_scaling_response_hz = 0;
+
+	/* Update the state of the controller. */
+	last_status->realtime_avg = realtime_avg;
+	last_status->value = (bottleneck_io ? VLP_BOTTLENECK_IO : 0);
+
+	/* Update state used for tracing. */
+	cpu->sample.busy_scaled = int_tofp(stats->max_scaling_response_hz);
+	cpu->iowait_boost = realtime_avg;
+
+	return last_status;
+}
+
+/**
+ * Calculate the target P-state range for the next update period.
+ * Uses a variably low-pass-filtering controller intended to improve
+ * energy efficiency when a CPU response frequency target is specified
+ * via PM QoS (e.g. under IO-bound conditions).
+ */
+static const struct vlp_target_range *get_vlp_target_range(struct cpudata *cpu)
+{
+	struct vlp_data *vlp = &cpu->vlp;
+	struct vlp_target_range *last_target = &vlp->target;
+
+	/*
+	 * P-state limits in fixed-point as allowed by the policy.
+	 */
+	const int32_t p0 = int_tofp(max(cpu->pstate.min_pstate,
+					cpu->min_perf_ratio));
+	const int32_t p1 = int_tofp(cpu->max_perf_ratio);
+
+	/*
+	 * Observed average P-state during the sampling period.	 The
+	 * conservative path (po_cons) uses the TSC increment as
+	 * denominator which will give the minimum (arguably most
+	 * energy-efficient) P-state able to accomplish the observed
+	 * amount of work during the sampling period.
+	 *
+	 * The downside of that somewhat optimistic estimate is that
+	 * it can give a biased result for intermittent
+	 * latency-sensitive workloads, which may have to be completed
+	 * in a short window of time for the system to achieve maximum
+	 * performance, even if the average CPU utilization is low.
+	 * For that reason the aggressive path (po_aggr) uses the
+	 * MPERF increment as denominator, which is approximately
+	 * optimal under the pessimistic assumption that the CPU work
+	 * cannot be parallelized with any other dependent IO work
+	 * that subsequently keeps the CPU idle (partly in C1+
+	 * states).
+	 */
+	const int32_t po_cons =
+		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
+		       * cpu->pstate.max_pstate_physical,
+		       cpu->sample.tsc);
+	const int32_t po_aggr =
+		div_fp((cpu->sample.aperf << cpu->aperf_mperf_shift)
+		       * cpu->pstate.max_pstate_physical,
+		       (cpu->sample.mperf << cpu->aperf_mperf_shift));
+
+	const struct vlp_status_sample *status =
+		get_vlp_status_sample(cpu, po_cons);
+
+	/* Calculate the target P-state. */
+	const int32_t p_tgt_cons = mul_fp(vlp->gain, po_cons);
+	const int32_t p_tgt_aggr = mul_fp(vlp->gain_aggr, po_aggr);
+	const int32_t p_tgt = max(p0, min(p1, max(p_tgt_cons, p_tgt_aggr)));
+
+	/* Calculate the realtime P-state target lower bound. */
+	const int32_t pm = int_tofp(cpu->pstate.max_pstate);
+	const int32_t p_tgt_rt = min(pm, mul_fp(vlp->gain_rt,
+						status->realtime_avg));
+
+	/*
+	 * Low-pass filter the P-state estimate above by exponential
+	 * averaging.  For an oscillating workload (e.g. submitting
+	 * work repeatedly to a device like a soundcard or GPU) this
+	 * will approximate the minimum P-state that would be able to
+	 * accomplish the observed amount of work during the averaging
+	 * period, which is also the optimally energy-efficient one,
+	 * under the assumptions that:
+	 *
+	 *  - The power curve of the system is convex throughout the
+	 *    range of P-states allowed by the policy. I.e. energy
+	 *    efficiency is steadily decreasing with frequency past p0
+	 *    (which is typically close to the maximum-efficiency
+	 *    ratio).  In practice for the lower range of P-states
+	 *    this may only be approximately true due to the
+	 *    interaction between different components of the system.
+	 *
+	 *  - Parallelism constraints of the workload don't prevent it
+	 *    from achieving the same throughput at the lower P-state.
+	 *    This will happen in cases where the application is
+	 *    designed in a way that doesn't allow for dependent CPU
+	 *    and IO jobs to be pipelined, leading to alternating full
+	 *    and zero utilization of the CPU and IO device.  This
+	 *    will give an average IO device utilization lower than
+	 *    100% regardless of the CPU frequency, which should
+	 *    prevent the device driver from requesting a response
+	 *    frequency bound, so the filtered P-state calculated
+	 *    below won't have an influence on the controller
+	 *    response.
+	 *
+	 *  - The period of the oscillating workload is significantly
+	 *    shorter than the time constant of the exponential
+	 *    average (1s / last_scaling_response_hz).  Otherwise for
+	 *    more slowly oscillating workloads the controller
+	 *    response will roughly follow the oscillation, leading to
+	 *    decreased energy efficiency.
+	 *
+	 *  - The behavior of the workload doesn't change
+	 *    qualitatively during the next update interval.  This is
+	 *    only true in the steady state, and could possibly lead
+	 *    to a transitory period in which the controller response
+	 *    deviates from the most energy-efficient ratio until the
+	 *    workload reaches a steady state again.
+	 */
+	const int32_t alpha = get_last_sample_avg_weight(
+		cpu, vlp->stats.last_scaling_response_hz);
+
+	last_target->p_base = p_tgt + mul_fp(alpha,
+					     last_target->p_base - p_tgt);
+
+	/*
+	 * Use the low-pass-filtered controller response for better
+	 * energy efficiency unless we have reasons to believe that
+	 * some of the optimality assumptions discussed above may not
+	 * hold.
+	 */
+	if ((status->value & VLP_BOTTLENECK_IO)) {
+		last_target->value[0] = rnd_fp(p0);
+		last_target->value[1] = rnd_fp(last_target->p_base);
+	} else {
+		last_target->value[0] = rnd_fp(p_tgt_rt);
+		last_target->value[1] = rnd_fp(p1);
+	}
+
+	return last_target;
+}
+
+/**
+ * Collect some scheduling and PM statistics in response to an
+ * update_state() call.
+ */
+static bool update_vlp_sample(struct cpudata *cpu, u64 time, unsigned int flags)
+{
+	struct vlp_input_stats *stats = &cpu->vlp.stats;
+
+	/* Update PM QoS scaling response frequency request. */
+	const uint32_t resp_hz = cpu_scaling_response_qos_limit();
+
+	stats->max_scaling_response_hz = !resp_hz ? UINT_MAX :
+		max(stats->max_scaling_response_hz, resp_hz);
+
+	/* Update scheduling statistics. */
+	if ((flags & SCHED_CPUFREQ_IOWAIT))
+		stats->io_wait_count++;
+
+	if (cpu_rq(cpu->cpu)->rt.rt_nr_running)
+		stats->realtime_count++;
+
+	/* Return whether a P-state update is due. */
+	return smp_processor_id() == cpu->cpu &&
+		time - cpu->sample.time >= cpu->vlp.sample_interval_ns &&
+		intel_pstate_sample(cpu, time);
+}
+
 static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
 {
 	int min_pstate = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

This implements a simple variably low-pass-filtering governor in
control of the HWP MIN/MAX PERF range based on the previously
introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
Implement VLP controller statistics and target range calculation." for
the rationale.

v3: Fix accidental overwrite of boot EPP setting in
    intel_pstate_reset_vlp() (Srinivas).  Rename
    CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).  Define
    generic CPUFREQ policy to control VLP governor (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 92 ++++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d13eb95a6271..0a315f681c43 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1913,6 +1913,20 @@ static void intel_pstate_reset_vlp(struct cpudata *cpu)
 	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
 	vlp->target.p_base = 0;
 	vlp->stats.last_scaling_response_hz = vlp_params.avg_hz;
+
+	if (hwp_active) {
+		const uint32_t p0 = max(cpu->pstate.min_pstate,
+					cpu->min_perf_ratio);
+		const uint32_t p1 = max_t(uint32_t, p0, cpu->max_perf_ratio);
+		const uint64_t hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
+					  ~(HWP_MAX_PERF(~0L) |
+					    HWP_MIN_PERF(~0L) |
+					    HWP_DESIRED_PERF(~0L))) |
+					 HWP_MIN_PERF(p0) | HWP_MAX_PERF(p1);
+
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
+		WRITE_ONCE(cpu->hwp_req_cached, hwp_req);
+	}
 }
 
 /**
@@ -2230,6 +2244,46 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
+					     const unsigned int range[])
+{
+	const uint64_t hwp_req_cached = READ_ONCE(cpu->hwp_req_cached);
+	unsigned int p0, p1, p_min, p_max;
+	struct sample *sample;
+	uint64_t hwp_req;
+
+	update_turbo_state();
+
+	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
+	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
+	p_min = clamp_t(unsigned int, range[0], p0, p1);
+	p_max = clamp_t(unsigned int, range[1], p0, p1);
+
+	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
+
+	hwp_req = (hwp_req_cached &
+		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
+		     HWP_DESIRED_PERF(~0L))) |
+		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
+		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
+
+	if (hwp_req != hwp_req_cached) {
+		wrmsrl(MSR_HWP_REQUEST, hwp_req);
+		WRITE_ONCE(cpu->hwp_req_cached, hwp_req);
+	}
+
+	sample = &cpu->sample;
+	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
+			    fp_toint(sample->busy_scaled),
+			    hwp_req_cached,
+			    hwp_req,
+			    sample->mperf,
+			    sample->aperf,
+			    sample->tsc,
+			    get_avg_frequency(cpu),
+			    fp_toint(cpu->iowait_boost * 100));
+}
+
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 				     unsigned int flags)
 {
@@ -2268,6 +2322,22 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 		intel_pstate_adjust_pstate(cpu);
 }
 
+/**
+ * Implementation of the cpufreq update_util hook based on the VLP
+ * controller (see get_vlp_target_range()).
+ */
+static void intel_pstate_update_util_hwp_vlp(struct update_util_data *data,
+					     u64 time, unsigned int flags)
+{
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+	if (update_vlp_sample(cpu, time, flags)) {
+		const struct vlp_target_range *target =
+			get_vlp_target_range(cpu);
+		intel_pstate_adjust_pstate_range(cpu, target->value);
+	}
+}
+
 static struct pstate_funcs core_funcs = {
 	.get_max = core_get_max_pstate,
 	.get_max_physical = core_get_max_pstate_physical,
@@ -2418,15 +2488,25 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 		intel_pstate_clear_update_util_hook(cpu_num);
 
 	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE ||
-	    (hwp_active && !hwp_boost))
+	    (cpu->policy != CPUFREQ_POLICY_ADAPTIVE &&
+	     hwp_active && !hwp_boost))
 		return;
 
 	/* Prevent intel_pstate_update_util() from using stale data. */
 	cpu->sample.time = 0;
-	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-				     (hwp_active ?
-				      intel_pstate_update_util_hwp :
-				      intel_pstate_update_util));
+
+	if (cpu->policy == CPUFREQ_POLICY_ADAPTIVE) {
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     (hwp_active ?
+					      intel_pstate_update_util_hwp_vlp :
+					      intel_pstate_update_util));
+	} else {
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     (hwp_active ?
+					      intel_pstate_update_util_hwp :
+					      intel_pstate_update_util));
+	}
+
 	cpu->update_util_set = true;
 }
 
@@ -2529,6 +2609,8 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 	if (hwp_active)
 		intel_pstate_hwp_set(policy->cpu);
 
+	intel_pstate_reset_vlp(cpu);
+
 	mutex_unlock(&intel_pstate_limits_lock);
 
 	return 0;
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

This implements a simple variably low-pass-filtering governor in
control of the HWP MIN/MAX PERF range based on the previously
introduced get_vlp_target_range().  See "cpufreq: intel_pstate:
Implement VLP controller statistics and target range calculation." for
the rationale.

v3: Fix accidental overwrite of boot EPP setting in
    intel_pstate_reset_vlp() (Srinivas).  Rename
    CPU_RESPONSE_FREQUENCY to CPU_SCALING_RESPONSE (Rafael).  Define
    generic CPUFREQ policy to control VLP governor (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 92 ++++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d13eb95a6271..0a315f681c43 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1913,6 +1913,20 @@ static void intel_pstate_reset_vlp(struct cpudata *cpu)
 	vlp->gain = max(1, div_fp(1000, vlp_params.setpoint_0_pml));
 	vlp->target.p_base = 0;
 	vlp->stats.last_scaling_response_hz = vlp_params.avg_hz;
+
+	if (hwp_active) {
+		const uint32_t p0 = max(cpu->pstate.min_pstate,
+					cpu->min_perf_ratio);
+		const uint32_t p1 = max_t(uint32_t, p0, cpu->max_perf_ratio);
+		const uint64_t hwp_req = (READ_ONCE(cpu->hwp_req_cached) &
+					  ~(HWP_MAX_PERF(~0L) |
+					    HWP_MIN_PERF(~0L) |
+					    HWP_DESIRED_PERF(~0L))) |
+					 HWP_MIN_PERF(p0) | HWP_MAX_PERF(p1);
+
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, hwp_req);
+		WRITE_ONCE(cpu->hwp_req_cached, hwp_req);
+	}
 }
 
 /**
@@ -2230,6 +2244,46 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
+					     const unsigned int range[])
+{
+	const uint64_t hwp_req_cached = READ_ONCE(cpu->hwp_req_cached);
+	unsigned int p0, p1, p_min, p_max;
+	struct sample *sample;
+	uint64_t hwp_req;
+
+	update_turbo_state();
+
+	p0 = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
+	p1 = max_t(unsigned int, p0, cpu->max_perf_ratio);
+	p_min = clamp_t(unsigned int, range[0], p0, p1);
+	p_max = clamp_t(unsigned int, range[1], p0, p1);
+
+	trace_cpu_frequency(p_max * cpu->pstate.scaling, cpu->cpu);
+
+	hwp_req = (hwp_req_cached &
+		   ~(HWP_MAX_PERF(~0L) | HWP_MIN_PERF(~0L) |
+		     HWP_DESIRED_PERF(~0L))) |
+		  HWP_MIN_PERF(vlp_params.debug & 2 ? p0 : p_min) |
+		  HWP_MAX_PERF(vlp_params.debug & 4 ? p1 : p_max);
+
+	if (hwp_req != hwp_req_cached) {
+		wrmsrl(MSR_HWP_REQUEST, hwp_req);
+		WRITE_ONCE(cpu->hwp_req_cached, hwp_req);
+	}
+
+	sample = &cpu->sample;
+	trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
+			    fp_toint(sample->busy_scaled),
+			    hwp_req_cached,
+			    hwp_req,
+			    sample->mperf,
+			    sample->aperf,
+			    sample->tsc,
+			    get_avg_frequency(cpu),
+			    fp_toint(cpu->iowait_boost * 100));
+}
+
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 				     unsigned int flags)
 {
@@ -2268,6 +2322,22 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 		intel_pstate_adjust_pstate(cpu);
 }
 
+/**
+ * Implementation of the cpufreq update_util hook based on the VLP
+ * controller (see get_vlp_target_range()).
+ */
+static void intel_pstate_update_util_hwp_vlp(struct update_util_data *data,
+					     u64 time, unsigned int flags)
+{
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+	if (update_vlp_sample(cpu, time, flags)) {
+		const struct vlp_target_range *target =
+			get_vlp_target_range(cpu);
+		intel_pstate_adjust_pstate_range(cpu, target->value);
+	}
+}
+
 static struct pstate_funcs core_funcs = {
 	.get_max = core_get_max_pstate,
 	.get_max_physical = core_get_max_pstate_physical,
@@ -2418,15 +2488,25 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 		intel_pstate_clear_update_util_hook(cpu_num);
 
 	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE ||
-	    (hwp_active && !hwp_boost))
+	    (cpu->policy != CPUFREQ_POLICY_ADAPTIVE &&
+	     hwp_active && !hwp_boost))
 		return;
 
 	/* Prevent intel_pstate_update_util() from using stale data. */
 	cpu->sample.time = 0;
-	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-				     (hwp_active ?
-				      intel_pstate_update_util_hwp :
-				      intel_pstate_update_util));
+
+	if (cpu->policy == CPUFREQ_POLICY_ADAPTIVE) {
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     (hwp_active ?
+					      intel_pstate_update_util_hwp_vlp :
+					      intel_pstate_update_util));
+	} else {
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     (hwp_active ?
+					      intel_pstate_update_util_hwp :
+					      intel_pstate_update_util));
+	}
+
 	cpu->update_util_set = true;
 }
 
@@ -2529,6 +2609,8 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 	if (hwp_active)
 		intel_pstate_hwp_set(policy->cpu);
 
+	intel_pstate_reset_vlp(cpu);
+
 	mutex_unlock(&intel_pstate_limits_lock);
 
 	return 0;
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

For the moment the VLP controller is only enabled on ICL platforms
other than server FADT profiles in order to reduce the validation
effort of the initial submission.  It should work on any other
processors that support HWP though (and soon enough on non-HWP too):
In order to override the default behavior (e.g. to test on other
platforms) the VLP controller can be forcefully enabled or disabled by
selecting the "adaptive" or "powersave" CPUFREQ governors respectively
via sysfs.

v2: Handle HWP VLP controller.
v3: Define generic CPUFREQ policy to control VLP governor (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 0a315f681c43..2458a821195f 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -396,6 +396,7 @@ static int hwp_active __read_mostly;
 static int hwp_mode_bdw __read_mostly;
 static bool per_cpu_limits __read_mostly;
 static bool hwp_boost __read_mostly;
+static bool vlp __read_mostly;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
@@ -2724,7 +2725,8 @@ static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
 	 * Set the policy to powersave to provide a valid fallback value in case
 	 * the default cpufreq governor is neither powersave nor performance.
 	 */
-	policy->policy = CPUFREQ_POLICY_POWERSAVE;
+	policy->policy = (vlp ? CPUFREQ_POLICY_ADAPTIVE :
+			  CPUFREQ_POLICY_POWERSAVE);
 
 	return 0;
 }
@@ -3209,6 +3211,16 @@ static const struct x86_cpu_id hwp_support_ids[] __initconst = {
 	{}
 };
 
+#define X86_MATCH_VLP(model)                                            \
+	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, INTEL_FAM6_##model, \
+					   X86_FEATURE_APERFMPERF, 0)
+
+static const struct x86_cpu_id vlp_default_ids[] __initconst = {
+	X86_MATCH_VLP(ICELAKE),
+	X86_MATCH_VLP(ICELAKE_L),
+	{}
+};
+
 static int __init intel_pstate_init(void)
 {
 	const struct x86_cpu_id *id;
@@ -3247,6 +3259,10 @@ static int __init intel_pstate_init(void)
 	default_driver = &intel_cpufreq;
 
 hwp_cpu_matched:
+	/* Enable VLP controller by default. */
+	vlp = !intel_pstate_acpi_pm_profile_server() &&
+	      x86_match_cpu(vlp_default_ids) && hwp_active;
+
 	/*
 	 * The Intel pstate driver will be ignored if the platform
 	 * firmware has its own power management modes.
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

For the moment the VLP controller is only enabled on ICL platforms
other than server FADT profiles in order to reduce the validation
effort of the initial submission.  It should work on any other
processors that support HWP though (and soon enough on non-HWP too):
In order to override the default behavior (e.g. to test on other
platforms) the VLP controller can be forcefully enabled or disabled by
selecting the "adaptive" or "powersave" CPUFREQ governors respectively
via sysfs.

v2: Handle HWP VLP controller.
v3: Define generic CPUFREQ policy to control VLP governor (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 0a315f681c43..2458a821195f 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -396,6 +396,7 @@ static int hwp_active __read_mostly;
 static int hwp_mode_bdw __read_mostly;
 static bool per_cpu_limits __read_mostly;
 static bool hwp_boost __read_mostly;
+static bool vlp __read_mostly;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
@@ -2724,7 +2725,8 @@ static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
 	 * Set the policy to powersave to provide a valid fallback value in case
 	 * the default cpufreq governor is neither powersave nor performance.
 	 */
-	policy->policy = CPUFREQ_POLICY_POWERSAVE;
+	policy->policy = (vlp ? CPUFREQ_POLICY_ADAPTIVE :
+			  CPUFREQ_POLICY_POWERSAVE);
 
 	return 0;
 }
@@ -3209,6 +3211,16 @@ static const struct x86_cpu_id hwp_support_ids[] __initconst = {
 	{}
 };
 
+#define X86_MATCH_VLP(model)                                            \
+	X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, INTEL_FAM6_##model, \
+					   X86_FEATURE_APERFMPERF, 0)
+
+static const struct x86_cpu_id vlp_default_ids[] __initconst = {
+	X86_MATCH_VLP(ICELAKE),
+	X86_MATCH_VLP(ICELAKE_L),
+	{}
+};
+
 static int __init intel_pstate_init(void)
 {
 	const struct x86_cpu_id *id;
@@ -3247,6 +3259,10 @@ static int __init intel_pstate_init(void)
 	default_driver = &intel_cpufreq;
 
 hwp_cpu_matched:
+	/* Enable VLP controller by default. */
+	vlp = !intel_pstate_acpi_pm_profile_server() &&
+	      x86_match_cpu(vlp_default_ids) && hwp_active;
+
 	/*
 	 * The Intel pstate driver will be ignored if the platform
 	 * firmware has its own power management modes.
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, Peter Zijlstra

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c |  9 ++++++---
 include/trace/events/power.h   | 13 +++++++++----
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 2458a821195f..dd86505d7855 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2242,7 +2242,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		fp_toint(cpu->iowait_boost * 100),
+		cpu->vlp.status.value);
 }
 
 static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
@@ -2282,7 +2283,8 @@ static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
 			    sample->aperf,
 			    sample->tsc,
 			    get_avg_frequency(cpu),
-			    fp_toint(cpu->iowait_boost * 100));
+			    fp_toint(cpu->iowait_boost * 100),
+			    cpu->vlp.status.value);
 }
 
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
@@ -2789,7 +2791,8 @@ static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, in
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		fp_toint(cpu->iowait_boost * 100),
+		0);
 }
 
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 9065a3b084c4..57153d82547c 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -72,7 +72,8 @@ TRACE_EVENT(pstate_sample,
 		u64 aperf,
 		u64 tsc,
 		u32 freq,
-		u32 io_boost
+		u32 io_boost,
+		u32 vlp_status
 		),
 
 	TP_ARGS(core_busy,
@@ -83,7 +84,8 @@ TRACE_EVENT(pstate_sample,
 		aperf,
 		tsc,
 		freq,
-		io_boost
+		io_boost,
+		vlp_status
 		),
 
 	TP_STRUCT__entry(
@@ -96,6 +98,7 @@ TRACE_EVENT(pstate_sample,
 		__field(u64, tsc)
 		__field(u32, freq)
 		__field(u32, io_boost)
+		__field(u32, vlp_status)
 		),
 
 	TP_fast_assign(
@@ -108,9 +111,10 @@ TRACE_EVENT(pstate_sample,
 		__entry->tsc = tsc;
 		__entry->freq = freq;
 		__entry->io_boost = io_boost;
+		__entry->vlp_status = vlp_status;
 		),
 
-	TP_printk("core_busy=%lu scaled=%lu from=%lu to=%lu mperf=%llu aperf=%llu tsc=%llu freq=%lu io_boost=%lu",
+	TP_printk("core_busy=%lu scaled=%lu from=%lu to=%lu mperf=%llu aperf=%llu tsc=%llu freq=%lu io_boost=%lu vlp=%lu",
 		(unsigned long)__entry->core_busy,
 		(unsigned long)__entry->scaled_busy,
 		(unsigned long)__entry->from,
@@ -119,7 +123,8 @@ TRACE_EVENT(pstate_sample,
 		(unsigned long long)__entry->aperf,
 		(unsigned long long)__entry->tsc,
 		(unsigned long)__entry->freq,
-		(unsigned long)__entry->io_boost
+		(unsigned long)__entry->io_boost,
+		(unsigned long)__entry->vlp_status
 		)
 
 );
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: Peter Zijlstra, intel-gfx, chris.p.wilson, linux-pm

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c |  9 ++++++---
 include/trace/events/power.h   | 13 +++++++++----
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 2458a821195f..dd86505d7855 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2242,7 +2242,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		fp_toint(cpu->iowait_boost * 100),
+		cpu->vlp.status.value);
 }
 
 static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
@@ -2282,7 +2283,8 @@ static void intel_pstate_adjust_pstate_range(struct cpudata *cpu,
 			    sample->aperf,
 			    sample->tsc,
 			    get_avg_frequency(cpu),
-			    fp_toint(cpu->iowait_boost * 100));
+			    fp_toint(cpu->iowait_boost * 100),
+			    cpu->vlp.status.value);
 }
 
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
@@ -2789,7 +2791,8 @@ static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, in
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		fp_toint(cpu->iowait_boost * 100),
+		0);
 }
 
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 9065a3b084c4..57153d82547c 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -72,7 +72,8 @@ TRACE_EVENT(pstate_sample,
 		u64 aperf,
 		u64 tsc,
 		u32 freq,
-		u32 io_boost
+		u32 io_boost,
+		u32 vlp_status
 		),
 
 	TP_ARGS(core_busy,
@@ -83,7 +84,8 @@ TRACE_EVENT(pstate_sample,
 		aperf,
 		tsc,
 		freq,
-		io_boost
+		io_boost,
+		vlp_status
 		),
 
 	TP_STRUCT__entry(
@@ -96,6 +98,7 @@ TRACE_EVENT(pstate_sample,
 		__field(u64, tsc)
 		__field(u32, freq)
 		__field(u32, io_boost)
+		__field(u32, vlp_status)
 		),
 
 	TP_fast_assign(
@@ -108,9 +111,10 @@ TRACE_EVENT(pstate_sample,
 		__entry->tsc = tsc;
 		__entry->freq = freq;
 		__entry->io_boost = io_boost;
+		__entry->vlp_status = vlp_status;
 		),
 
-	TP_printk("core_busy=%lu scaled=%lu from=%lu to=%lu mperf=%llu aperf=%llu tsc=%llu freq=%lu io_boost=%lu",
+	TP_printk("core_busy=%lu scaled=%lu from=%lu to=%lu mperf=%llu aperf=%llu tsc=%llu freq=%lu io_boost=%lu vlp=%lu",
 		(unsigned long)__entry->core_busy,
 		(unsigned long)__entry->scaled_busy,
 		(unsigned long)__entry->from,
@@ -119,7 +123,8 @@ TRACE_EVENT(pstate_sample,
 		(unsigned long long)__entry->aperf,
 		(unsigned long long)__entry->tsc,
 		(unsigned long)__entry->freq,
-		(unsigned long)__entry->io_boost
+		(unsigned long)__entry->io_boost,
+		(unsigned long)__entry->vlp_status
 		)
 
 );
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-04-28  3:22   ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo,
	Peter Zijlstra, Fengguang Wu, Julia Lawall

This is not required for the controller to work but has proven very
useful for debugging and testing of alternative heuristic parameters,
which may offer a better trade-off between energy efficiency and
latency.  A warning is printed out which should taint the kernel for
the non-standard calibration of the heuristic to be obvious in bug
reports.

v2: Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
    for debugfs files (Julia).  Add realtime statistic threshold and
    averaging frequency parameters.
v3: Define generic CPUFREQ policy to control VLP governor (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
---
 drivers/cpufreq/intel_pstate.c | 86 ++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index dd86505d7855..ab0334a99039 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1031,6 +1031,88 @@ static void intel_pstate_update_limits(unsigned int cpu)
 	mutex_unlock(&intel_pstate_driver_lock);
 }
 
+/************************** debugfs begin ************************/
+static void intel_pstate_reset_vlp(struct cpudata *cpu);
+
+static int vlp_param_set(void *data, u64 val)
+{
+	unsigned int cpu;
+
+	*(u32 *)data = val;
+	for_each_possible_cpu(cpu) {
+		if (all_cpu_data[cpu])
+			intel_pstate_reset_vlp(all_cpu_data[cpu]);
+	}
+
+	WARN_ONCE(1, "Unsupported P-state VLP parameter update via debugging interface");
+
+	return 0;
+}
+
+static int vlp_param_get(void *data, u64 *val)
+{
+	*val = *(u32 *)data;
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(fops_vlp_param, vlp_param_get, vlp_param_set,
+			 "%llu\n");
+
+static struct dentry *debugfs_parent;
+
+struct vlp_param {
+	char *name;
+	void *value;
+	struct dentry *dentry;
+};
+
+static struct vlp_param vlp_files[] = {
+	{"vlp_sample_interval_ms", &vlp_params.sample_interval_ms, },
+	{"vlp_setpoint_0_pml", &vlp_params.setpoint_0_pml, },
+	{"vlp_setpoint_aggr_pml", &vlp_params.setpoint_aggr_pml, },
+	{"vlp_avg_hz", &vlp_params.avg_hz, },
+	{"vlp_realtime_gain_pml", &vlp_params.realtime_gain_pml, },
+	{"vlp_debug", &vlp_params.debug, },
+	{NULL, NULL, }
+};
+
+static void intel_pstate_debug_expose_params(void)
+{
+	int i;
+
+	debugfs_parent = debugfs_create_dir("pstate_snb", NULL);
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; vlp_files[i].name; i++) {
+		struct dentry *dentry;
+
+		dentry = debugfs_create_file_unsafe(vlp_files[i].name, 0660,
+						    debugfs_parent,
+						    vlp_files[i].value,
+						    &fops_vlp_param);
+		if (!IS_ERR(dentry))
+			vlp_files[i].dentry = dentry;
+	}
+}
+
+static void intel_pstate_debug_hide_params(void)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; vlp_files[i].name; i++) {
+		debugfs_remove(vlp_files[i].dentry);
+		vlp_files[i].dentry = NULL;
+	}
+
+	debugfs_remove(debugfs_parent);
+	debugfs_parent = NULL;
+}
+
+/************************** debugfs end ************************/
+
 /************************** sysfs begin ************************/
 #define show_one(file_name, object)					\
 	static ssize_t show_##file_name					\
@@ -2977,6 +3059,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	intel_pstate_debug_expose_params();
+
 	return 0;
 }
 
@@ -2985,6 +3069,8 @@ static int intel_pstate_unregister_driver(void)
 	if (hwp_active)
 		return -EBUSY;
 
+	intel_pstate_debug_hide_params();
+
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
-- 
2.22.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.
@ 2020-04-28  3:22   ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-04-28  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pandruvada, Srinivas
  Cc: linux-pm, Peter Zijlstra, intel-gfx, chris.p.wilson,
	Julia Lawall, Fengguang Wu

This is not required for the controller to work but has proven very
useful for debugging and testing of alternative heuristic parameters,
which may offer a better trade-off between energy efficiency and
latency.  A warning is printed out which should taint the kernel for
the non-standard calibration of the heuristic to be obvious in bug
reports.

v2: Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
    for debugfs files (Julia).  Add realtime statistic threshold and
    averaging frequency parameters.
v3: Define generic CPUFREQ policy to control VLP governor (Rafael).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
---
 drivers/cpufreq/intel_pstate.c | 86 ++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index dd86505d7855..ab0334a99039 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1031,6 +1031,88 @@ static void intel_pstate_update_limits(unsigned int cpu)
 	mutex_unlock(&intel_pstate_driver_lock);
 }
 
+/************************** debugfs begin ************************/
+static void intel_pstate_reset_vlp(struct cpudata *cpu);
+
+static int vlp_param_set(void *data, u64 val)
+{
+	unsigned int cpu;
+
+	*(u32 *)data = val;
+	for_each_possible_cpu(cpu) {
+		if (all_cpu_data[cpu])
+			intel_pstate_reset_vlp(all_cpu_data[cpu]);
+	}
+
+	WARN_ONCE(1, "Unsupported P-state VLP parameter update via debugging interface");
+
+	return 0;
+}
+
+static int vlp_param_get(void *data, u64 *val)
+{
+	*val = *(u32 *)data;
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(fops_vlp_param, vlp_param_get, vlp_param_set,
+			 "%llu\n");
+
+static struct dentry *debugfs_parent;
+
+struct vlp_param {
+	char *name;
+	void *value;
+	struct dentry *dentry;
+};
+
+static struct vlp_param vlp_files[] = {
+	{"vlp_sample_interval_ms", &vlp_params.sample_interval_ms, },
+	{"vlp_setpoint_0_pml", &vlp_params.setpoint_0_pml, },
+	{"vlp_setpoint_aggr_pml", &vlp_params.setpoint_aggr_pml, },
+	{"vlp_avg_hz", &vlp_params.avg_hz, },
+	{"vlp_realtime_gain_pml", &vlp_params.realtime_gain_pml, },
+	{"vlp_debug", &vlp_params.debug, },
+	{NULL, NULL, }
+};
+
+static void intel_pstate_debug_expose_params(void)
+{
+	int i;
+
+	debugfs_parent = debugfs_create_dir("pstate_snb", NULL);
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; vlp_files[i].name; i++) {
+		struct dentry *dentry;
+
+		dentry = debugfs_create_file_unsafe(vlp_files[i].name, 0660,
+						    debugfs_parent,
+						    vlp_files[i].value,
+						    &fops_vlp_param);
+		if (!IS_ERR(dentry))
+			vlp_files[i].dentry = dentry;
+	}
+}
+
+static void intel_pstate_debug_hide_params(void)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; vlp_files[i].name; i++) {
+		debugfs_remove(vlp_files[i].dentry);
+		vlp_files[i].dentry = NULL;
+	}
+
+	debugfs_remove(debugfs_parent);
+	debugfs_parent = NULL;
+}
+
+/************************** debugfs end ************************/
+
 /************************** sysfs begin ************************/
 #define show_one(file_name, object)					\
 	static ssize_t show_##file_name					\
@@ -2977,6 +3059,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	intel_pstate_debug_expose_params();
+
 	return 0;
 }
 
@@ -2985,6 +3069,8 @@ static int intel_pstate_unregister_driver(void)
 	if (hwp_active)
 		return -EBUSY;
 
+	intel_pstate_debug_hide_params();
+
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
-- 
2.22.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [PATCHv2.99,01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
                   ` (11 preceding siblings ...)
  (?)
@ 2020-04-28  3:32 ` Patchwork
  -1 siblings, 0 replies; 41+ messages in thread
From: Patchwork @ 2020-04-28  3:32 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: intel-gfx

== Series Details ==

Series: series starting with [PATCHv2.99,01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
URL   : https://patchwork.freedesktop.org/series/76576/
State : failure

== Summary ==

Applying: PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
Applying: drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/gt/intel_lrc.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
@ 2020-05-11 10:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2020-05-11 10:57 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: Rafael J. Wysocki, Pandruvada, Srinivas, linux-pm, intel-gfx,
	chris.p.wilson, Vivi, Rodrigo, rui.zhang, daniel.lezcano,
	amit.kucheria

On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
> This addresses the technical concerns people brought up about my
> previous v2 revision of this series.  Other than a few bug fixes, the
> only major change relative to v2 is that the controller is now exposed
> as a new CPUFREQ generic governor as requested by Rafael (named
> "adaptive" in this RFC though other naming suggestions are welcome).
> Main reason for calling this v2.99 rather than v3 is that I haven't
> yet addressed all the documentation requests from the v2 thread --
> Will spend some time doing that as soon as I have an ACK (ideally from
> Rafael) that things are moving in the right direction.
> 
> You can also find this series along with the WIP code for non-HWP
> platforms in this branch:
> 
> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
> 
> Thanks!
> 
> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.

What I'm missing is an explanation for why this isn't using the
infrastructure that was build for these kinds of things? The thermal
framework, was AFAIU, supposed to help with these things, and the IPA
thing in particular is used by ARM to do exactly this GPU/CPU power
budget thing.

If thermal/IPA is found wanting, why aren't we improving that?

How much of that ADAPTIVE crud is actually intel_pstate specific? On a
(really) quick read it appears to me that much of the controller bits
there can be applied more generic, and thus should not be part of any
one governor.

Specifically, I want to use sched_util as cpufreq governor and use the
intel_pstate as a passive driver.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-11 10:57   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2020-05-11 10:57 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: amit.kucheria, linux-pm, intel-gfx, daniel.lezcano,
	Rafael J. Wysocki, chris.p.wilson, Pandruvada, Srinivas,
	rui.zhang

On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
> This addresses the technical concerns people brought up about my
> previous v2 revision of this series.  Other than a few bug fixes, the
> only major change relative to v2 is that the controller is now exposed
> as a new CPUFREQ generic governor as requested by Rafael (named
> "adaptive" in this RFC though other naming suggestions are welcome).
> Main reason for calling this v2.99 rather than v3 is that I haven't
> yet addressed all the documentation requests from the v2 thread --
> Will spend some time doing that as soon as I have an ACK (ideally from
> Rafael) that things are moving in the right direction.
> 
> You can also find this series along with the WIP code for non-HWP
> platforms in this branch:
> 
> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
> 
> Thanks!
> 
> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.

What I'm missing is an explanation for why this isn't using the
infrastructure that was build for these kinds of things? The thermal
framework, was AFAIU, supposed to help with these things, and the IPA
thing in particular is used by ARM to do exactly this GPU/CPU power
budget thing.

If thermal/IPA is found wanting, why aren't we improving that?

How much of that ADAPTIVE crud is actually intel_pstate specific? On a
(really) quick read it appears to me that much of the controller bits
there can be applied more generic, and thus should not be part of any
one governor.

Specifically, I want to use sched_util as cpufreq governor and use the
intel_pstate as a passive driver.


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-11 10:57   ` [Intel-gfx] " Peter Zijlstra
@ 2020-05-11 21:01     ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-05-11 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Pandruvada, Srinivas, linux-pm, intel-gfx,
	chris.p.wilson, Vivi, Rodrigo, rui.zhang, daniel.lezcano,
	amit.kucheria


[-- Attachment #1.1: Type: text/plain, Size: 4048 bytes --]

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
>> This addresses the technical concerns people brought up about my
>> previous v2 revision of this series.  Other than a few bug fixes, the
>> only major change relative to v2 is that the controller is now exposed
>> as a new CPUFREQ generic governor as requested by Rafael (named
>> "adaptive" in this RFC though other naming suggestions are welcome).
>> Main reason for calling this v2.99 rather than v3 is that I haven't
>> yet addressed all the documentation requests from the v2 thread --
>> Will spend some time doing that as soon as I have an ACK (ideally from
>> Rafael) that things are moving in the right direction.
>> 
>> You can also find this series along with the WIP code for non-HWP
>> platforms in this branch:
>> 
>> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
>> 
>> Thanks!
>> 
>> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
>> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
>> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
>> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
>> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
>> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
>> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
>> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
>> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
>> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
>> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.
>
> What I'm missing is an explanation for why this isn't using the
> infrastructure that was build for these kinds of things? The thermal
> framework, was AFAIU, supposed to help with these things, and the IPA
> thing in particular is used by ARM to do exactly this GPU/CPU power
> budget thing.
>
> If thermal/IPA is found wanting, why aren't we improving that?

The GPU/CPU power budget "thing" is only a positive side effect of this
series on some TDP-bound systems.  Its ultimate purpose is improving the
energy efficiency of workloads which have a bottleneck on a device other
than the CPU, by giving the bottlenecking device driver some influence
over the response latency of CPUFREQ governors via a PM QoS interface.
This seems to be completely outside the scope of the thermal framework
and IPA AFAIU.

>
> How much of that ADAPTIVE crud is actually intel_pstate specific? On a
> (really) quick read it appears to me that much of the controller bits
> there can be applied more generic, and thus should not be part of any
> one governor.
>

The implementation of that is intel_pstate-specific right now, but the
basic algorithm could be made to work on any other governor in
principle, which is why it is exposed as a generic CPUFREQ governor.  I
don't care about taking out the generic CPUFREQ governor changes if you
don't like them, and going back to some driver-specific means of turning
it on and off (though Rafael might disagree with that).

> Specifically, I want to use sched_util as cpufreq governor and use the
> intel_pstate as a passive driver.

Yeah, getting a similar optimization into the schedutil governor has
been on my wish list for a while, but I haven't had the time to get very
far on that except for a handful of hacks.  The intel_pstate handling is
going to be necessary anyway in order to handle HWP systems gracefully,
at least in the near future until schedutil becomes a viable alternative
to intel_pstate in active mode on HWP systems.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-11 21:01     ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-05-11 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: amit.kucheria, linux-pm, intel-gfx, daniel.lezcano,
	Rafael J. Wysocki, chris.p.wilson, Pandruvada, Srinivas,
	rui.zhang


[-- Attachment #1.1.1: Type: text/plain, Size: 4048 bytes --]

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
>> This addresses the technical concerns people brought up about my
>> previous v2 revision of this series.  Other than a few bug fixes, the
>> only major change relative to v2 is that the controller is now exposed
>> as a new CPUFREQ generic governor as requested by Rafael (named
>> "adaptive" in this RFC though other naming suggestions are welcome).
>> Main reason for calling this v2.99 rather than v3 is that I haven't
>> yet addressed all the documentation requests from the v2 thread --
>> Will spend some time doing that as soon as I have an ACK (ideally from
>> Rafael) that things are moving in the right direction.
>> 
>> You can also find this series along with the WIP code for non-HWP
>> platforms in this branch:
>> 
>> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
>> 
>> Thanks!
>> 
>> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
>> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load.
>> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs.
>> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
>> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook().
>> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook.
>> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation.
>> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts.
>> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID.
>> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status.
>> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs.
>
> What I'm missing is an explanation for why this isn't using the
> infrastructure that was build for these kinds of things? The thermal
> framework, was AFAIU, supposed to help with these things, and the IPA
> thing in particular is used by ARM to do exactly this GPU/CPU power
> budget thing.
>
> If thermal/IPA is found wanting, why aren't we improving that?

The GPU/CPU power budget "thing" is only a positive side effect of this
series on some TDP-bound systems.  Its ultimate purpose is improving the
energy efficiency of workloads which have a bottleneck on a device other
than the CPU, by giving the bottlenecking device driver some influence
over the response latency of CPUFREQ governors via a PM QoS interface.
This seems to be completely outside the scope of the thermal framework
and IPA AFAIU.

>
> How much of that ADAPTIVE crud is actually intel_pstate specific? On a
> (really) quick read it appears to me that much of the controller bits
> there can be applied more generic, and thus should not be part of any
> one governor.
>

The implementation of that is intel_pstate-specific right now, but the
basic algorithm could be made to work on any other governor in
principle, which is why it is exposed as a generic CPUFREQ governor.  I
don't care about taking out the generic CPUFREQ governor changes if you
don't like them, and going back to some driver-specific means of turning
it on and off (though Rafael might disagree with that).

> Specifically, I want to use sched_util as cpufreq governor and use the
> intel_pstate as a passive driver.

Yeah, getting a similar optimization into the schedutil governor has
been on my wish list for a while, but I haven't had the time to get very
far on that except for a handful of hacks.  The intel_pstate handling is
going to be necessary anyway in order to handle HWP systems gracefully,
at least in the near future until schedutil becomes a viable alternative
to intel_pstate in active mode on HWP systems.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-11 21:01     ` [Intel-gfx] " Francisco Jerez
@ 2020-05-14 10:26       ` Rafael J. Wysocki
  -1 siblings, 0 replies; 41+ messages in thread
From: Rafael J. Wysocki @ 2020-05-14 10:26 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: Peter Zijlstra, Pandruvada, Srinivas, linux-pm, intel-gfx,
	chris.p.wilson, Vivi, Rodrigo, rui.zhang, daniel.lezcano,
	amit.kucheria

On Monday, May 11, 2020 11:01:41 PM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
> >> This addresses the technical concerns people brought up about my
> >> previous v2 revision of this series.  Other than a few bug fixes, the
> >> only major change relative to v2 is that the controller is now exposed
> >> as a new CPUFREQ generic governor as requested by Rafael (named
> >> "adaptive" in this RFC though other naming suggestions are welcome).
> >> Main reason for calling this v2.99 rather than v3 is that I haven't
> >> yet addressed all the documentation requests from the v2 thread --
> >> Will spend some time doing that as soon as I have an ACK (ideally from
> >> Rafael) that things are moving in the right direction.
> >>=20
> >> You can also find this series along with the WIP code for non-HWP
> >> platforms in this branch:
> >>=20
> >> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
> >>=20
> >> Thanks!
> >>=20
> >> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
> >> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency ba=
> sed on GPU load.
> >> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters =
> via debugfs.
> >> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
> >> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_upd=
> ate_util_hook() and intel_pstate_set_update_util_hook().
> >> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_u=
> til_hook() once from the setpolicy hook.
> >> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller stati=
> stics and target range calculation.
> >> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for H=
> WP parts.
> >> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on=
>  ACPI FADT profile and CPUID.
> >> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP c=
> ontroller status.
> >> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controlle=
> r parameters via debugfs.
> >
> > What I'm missing is an explanation for why this isn't using the
> > infrastructure that was build for these kinds of things? The thermal
> > framework, was AFAIU, supposed to help with these things, and the IPA
> > thing in particular is used by ARM to do exactly this GPU/CPU power
> > budget thing.
> >
> > If thermal/IPA is found wanting, why aren't we improving that?
> 
> The GPU/CPU power budget "thing" is only a positive side effect of this
> series on some TDP-bound systems.  Its ultimate purpose is improving the
> energy efficiency of workloads which have a bottleneck on a device other
> than the CPU, by giving the bottlenecking device driver some influence
> over the response latency of CPUFREQ governors via a PM QoS interface.
> This seems to be completely outside the scope of the thermal framework
> and IPA AFAIU.
> 
> >
> > How much of that ADAPTIVE crud is actually intel_pstate specific? On a
> > (really) quick read it appears to me that much of the controller bits
> > there can be applied more generic, and thus should not be part of any
> > one governor.
> >
> 
> The implementation of that is intel_pstate-specific right now, but the
> basic algorithm could be made to work on any other governor in
> principle, which is why it is exposed as a generic CPUFREQ governor.  I
> don't care about taking out the generic CPUFREQ governor changes if you
> don't like them, and going back to some driver-specific means of turning
> it on and off (though Rafael might disagree with that).
> 
> > Specifically, I want to use sched_util as cpufreq governor and use the
> > intel_pstate as a passive driver.
> 
> Yeah, getting a similar optimization into the schedutil governor has
> been on my wish list for a while, but I haven't had the time to get very
> far on that except for a handful of hacks.  The intel_pstate handling is
> going to be necessary anyway in order to handle HWP systems gracefully,
> at least in the near future until schedutil becomes a viable alternative
> to intel_pstate in active mode on HWP systems.

FWIW, work is under way to make intel_pstate in the passive mode work on HWP
systems.

I have a prototype patch for that, but it can be improved.  I'll post a new
version of it for review, possibly next week.

Cheers!




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-14 10:26       ` Rafael J. Wysocki
  0 siblings, 0 replies; 41+ messages in thread
From: Rafael J. Wysocki @ 2020-05-14 10:26 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: amit.kucheria, linux-pm, Peter Zijlstra, intel-gfx,
	daniel.lezcano, chris.p.wilson, Pandruvada, Srinivas, rui.zhang

On Monday, May 11, 2020 11:01:41 PM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
> >> This addresses the technical concerns people brought up about my
> >> previous v2 revision of this series.  Other than a few bug fixes, the
> >> only major change relative to v2 is that the controller is now exposed
> >> as a new CPUFREQ generic governor as requested by Rafael (named
> >> "adaptive" in this RFC though other naming suggestions are welcome).
> >> Main reason for calling this v2.99 rather than v3 is that I haven't
> >> yet addressed all the documentation requests from the v2 thread --
> >> Will spend some time doing that as soon as I have an ACK (ideally from
> >> Rafael) that things are moving in the right direction.
> >>=20
> >> You can also find this series along with the WIP code for non-HWP
> >> platforms in this branch:
> >>=20
> >> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
> >>=20
> >> Thanks!
> >>=20
> >> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
> >> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency ba=
> sed on GPU load.
> >> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters =
> via debugfs.
> >> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
> >> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_upd=
> ate_util_hook() and intel_pstate_set_update_util_hook().
> >> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_u=
> til_hook() once from the setpolicy hook.
> >> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller stati=
> stics and target range calculation.
> >> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for H=
> WP parts.
> >> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on=
>  ACPI FADT profile and CPUID.
> >> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP c=
> ontroller status.
> >> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controlle=
> r parameters via debugfs.
> >
> > What I'm missing is an explanation for why this isn't using the
> > infrastructure that was build for these kinds of things? The thermal
> > framework, was AFAIU, supposed to help with these things, and the IPA
> > thing in particular is used by ARM to do exactly this GPU/CPU power
> > budget thing.
> >
> > If thermal/IPA is found wanting, why aren't we improving that?
> 
> The GPU/CPU power budget "thing" is only a positive side effect of this
> series on some TDP-bound systems.  Its ultimate purpose is improving the
> energy efficiency of workloads which have a bottleneck on a device other
> than the CPU, by giving the bottlenecking device driver some influence
> over the response latency of CPUFREQ governors via a PM QoS interface.
> This seems to be completely outside the scope of the thermal framework
> and IPA AFAIU.
> 
> >
> > How much of that ADAPTIVE crud is actually intel_pstate specific? On a
> > (really) quick read it appears to me that much of the controller bits
> > there can be applied more generic, and thus should not be part of any
> > one governor.
> >
> 
> The implementation of that is intel_pstate-specific right now, but the
> basic algorithm could be made to work on any other governor in
> principle, which is why it is exposed as a generic CPUFREQ governor.  I
> don't care about taking out the generic CPUFREQ governor changes if you
> don't like them, and going back to some driver-specific means of turning
> it on and off (though Rafael might disagree with that).
> 
> > Specifically, I want to use sched_util as cpufreq governor and use the
> > intel_pstate as a passive driver.
> 
> Yeah, getting a similar optimization into the schedutil governor has
> been on my wish list for a while, but I haven't had the time to get very
> far on that except for a handful of hacks.  The intel_pstate handling is
> going to be necessary anyway in order to handle HWP systems gracefully,
> at least in the near future until schedutil becomes a viable alternative
> to intel_pstate in active mode on HWP systems.

FWIW, work is under way to make intel_pstate in the passive mode work on HWP
systems.

I have a prototype patch for that, but it can be improved.  I'll post a new
version of it for review, possibly next week.

Cheers!



_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-11 21:01     ` [Intel-gfx] " Francisco Jerez
@ 2020-05-14 11:50       ` Valentin Schneider
  -1 siblings, 0 replies; 41+ messages in thread
From: Valentin Schneider @ 2020-05-14 11:50 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas,
	linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, rui.zhang,
	daniel.lezcano, amit.kucheria, Lukasz Luba


(+Lukasz)

On 11/05/20 22:01, Francisco Jerez wrote:
>> What I'm missing is an explanation for why this isn't using the
>> infrastructure that was build for these kinds of things? The thermal
>> framework, was AFAIU, supposed to help with these things, and the IPA
>> thing in particular is used by ARM to do exactly this GPU/CPU power
>> budget thing.
>>
>> If thermal/IPA is found wanting, why aren't we improving that?
>
> The GPU/CPU power budget "thing" is only a positive side effect of this
> series on some TDP-bound systems.  Its ultimate purpose is improving the
> energy efficiency of workloads which have a bottleneck on a device other
> than the CPU, by giving the bottlenecking device driver some influence
> over the response latency of CPUFREQ governors via a PM QoS interface.
> This seems to be completely outside the scope of the thermal framework
> and IPA AFAIU.
>

It's been a while since I've stared at IPA, but it does sound vaguely
familiar.

When thermally constrained, IPA figures out a budget and splits it between
actors (cpufreq and devfreq devices) depending on how much juice they are
asking for; see cpufreq_get_requested_power() and
devfreq_cooling_get_requested_power(). There's also some weighing involved.

If you look at the cpufreq cooling side of things, you'll see it also uses
the PM QoS interface. For instance, should IPA decide to cap the CPUs
(perhaps because say the GPU is the one drawing most of the juice), it'll
lead to a maximum frequency capping request.

So it does sound like that's what you want, only not just when thermally
constrained.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-14 11:50       ` Valentin Schneider
  0 siblings, 0 replies; 41+ messages in thread
From: Valentin Schneider @ 2020-05-14 11:50 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: amit.kucheria, linux-pm, Peter Zijlstra, intel-gfx,
	daniel.lezcano, Rafael J. Wysocki, chris.p.wilson, Pandruvada,
	Srinivas, rui.zhang, Lukasz Luba


(+Lukasz)

On 11/05/20 22:01, Francisco Jerez wrote:
>> What I'm missing is an explanation for why this isn't using the
>> infrastructure that was build for these kinds of things? The thermal
>> framework, was AFAIU, supposed to help with these things, and the IPA
>> thing in particular is used by ARM to do exactly this GPU/CPU power
>> budget thing.
>>
>> If thermal/IPA is found wanting, why aren't we improving that?
>
> The GPU/CPU power budget "thing" is only a positive side effect of this
> series on some TDP-bound systems.  Its ultimate purpose is improving the
> energy efficiency of workloads which have a bottleneck on a device other
> than the CPU, by giving the bottlenecking device driver some influence
> over the response latency of CPUFREQ governors via a PM QoS interface.
> This seems to be completely outside the scope of the thermal framework
> and IPA AFAIU.
>

It's been a while since I've stared at IPA, but it does sound vaguely
familiar.

When thermally constrained, IPA figures out a budget and splits it between
actors (cpufreq and devfreq devices) depending on how much juice they are
asking for; see cpufreq_get_requested_power() and
devfreq_cooling_get_requested_power(). There's also some weighing involved.

If you look at the cpufreq cooling side of things, you'll see it also uses
the PM QoS interface. For instance, should IPA decide to cap the CPUs
(perhaps because say the GPU is the one drawing most of the juice), it'll
lead to a maximum frequency capping request.

So it does sound like that's what you want, only not just when thermally
constrained.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-14 11:50       ` [Intel-gfx] " Valentin Schneider
@ 2020-05-15  0:48         ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-05-15  0:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas,
	linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, rui.zhang,
	daniel.lezcano, amit.kucheria, Lukasz Luba


[-- Attachment #1.1: Type: text/plain, Size: 4393 bytes --]

Valentin Schneider <valentin.schneider@arm.com> writes:

> (+Lukasz)
>
> On 11/05/20 22:01, Francisco Jerez wrote:
>>> What I'm missing is an explanation for why this isn't using the
>>> infrastructure that was build for these kinds of things? The thermal
>>> framework, was AFAIU, supposed to help with these things, and the IPA
>>> thing in particular is used by ARM to do exactly this GPU/CPU power
>>> budget thing.
>>>
>>> If thermal/IPA is found wanting, why aren't we improving that?
>>
>> The GPU/CPU power budget "thing" is only a positive side effect of this
>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>> energy efficiency of workloads which have a bottleneck on a device other
>> than the CPU, by giving the bottlenecking device driver some influence
>> over the response latency of CPUFREQ governors via a PM QoS interface.
>> This seems to be completely outside the scope of the thermal framework
>> and IPA AFAIU.
>>
>
> It's been a while since I've stared at IPA, but it does sound vaguely
> familiar.
>
> When thermally constrained, IPA figures out a budget and splits it between
> actors (cpufreq and devfreq devices) depending on how much juice they are
> asking for; see cpufreq_get_requested_power() and
> devfreq_cooling_get_requested_power(). There's also some weighing involved.
>

I'm aware of those.  Main problem is that the current mechanism for IPA
to figure out the requested power of each actor is based on a rough
estimate of their past power consumption: If an actor was operating at a
highly energy-inefficient regime it will end up requesting more power
than another actor with the same load but more energy-efficient
behavior.  The IPA power allocator is therefore ineffective at improving
the energy efficiency of an agent beyond its past behavior --
Furthermore it seems to *rely* on individual agents being somewhat
energetically responsible in order for its power allocation result to be
anywhere close to optimal.  But improving the energy efficiency of an
agent seems useful in its own right, whether IPA is used to balance
power across agents or not.  That's precisely the purpose of this
series.

> If you look at the cpufreq cooling side of things, you'll see it also uses
> the PM QoS interface. For instance, should IPA decide to cap the CPUs
> (perhaps because say the GPU is the one drawing most of the juice), it'll
> lead to a maximum frequency capping request.
>
> So it does sound like that's what you want, only not just when thermally
> constrained.

Capping the CPU frequency from random device drivers is highly
problematic, because the CPU is a shared resource which a number of
different concurrent applications might be using beyond the GPU client.
The GPU driver has no visibility over its impact on the performance of
other applications.  And even in a single-task environment, in order to
behave as effectively as the present series the GPU driver would need to
monitor the utilization of *all* CPUs in the system and place a
frequency constraint on each one of them (since there is the potential
of the task scheduler migrating the process from one CPU to another
without notice).  Furthermore these frequency constraints would need to
be updated at high frequency in order to avoid performance degradation
whenever the balance of load between CPU and IO device fluctuates.

The present series attempts to remove the burden of computing frequency
constraints out of individual device drivers into the CPUFREQ governor.
Instead the device driver provides a response latency constraint when it
encounters a bottleneck, which can be more easily derived from hardware
and protocol characteristics than a CPU frequency.  PM QoS aggregates
the response latency constraints provided by all applications and gives
CPUFREQ a single response latency target compatible with all of them (so
a device driver specifying a high latency target won't lead to
performance degradation in a concurrent application with lower latency
constraints).  The CPUFREQ governor then computes frequency constraints
for each CPU core that minimize energy usage without limiting
throughput, based on the results obtained from CPU performance counters,
while guaranteeing that a discontinuous transition in CPU utilization
leads to a proportional transition in the CPU frequency before the
specified response latency has elapsed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-15  0:48         ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-05-15  0:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: amit.kucheria, linux-pm, Peter Zijlstra, intel-gfx,
	daniel.lezcano, Rafael J. Wysocki, chris.p.wilson, Pandruvada,
	Srinivas, rui.zhang, Lukasz Luba


[-- Attachment #1.1.1: Type: text/plain, Size: 4393 bytes --]

Valentin Schneider <valentin.schneider@arm.com> writes:

> (+Lukasz)
>
> On 11/05/20 22:01, Francisco Jerez wrote:
>>> What I'm missing is an explanation for why this isn't using the
>>> infrastructure that was build for these kinds of things? The thermal
>>> framework, was AFAIU, supposed to help with these things, and the IPA
>>> thing in particular is used by ARM to do exactly this GPU/CPU power
>>> budget thing.
>>>
>>> If thermal/IPA is found wanting, why aren't we improving that?
>>
>> The GPU/CPU power budget "thing" is only a positive side effect of this
>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>> energy efficiency of workloads which have a bottleneck on a device other
>> than the CPU, by giving the bottlenecking device driver some influence
>> over the response latency of CPUFREQ governors via a PM QoS interface.
>> This seems to be completely outside the scope of the thermal framework
>> and IPA AFAIU.
>>
>
> It's been a while since I've stared at IPA, but it does sound vaguely
> familiar.
>
> When thermally constrained, IPA figures out a budget and splits it between
> actors (cpufreq and devfreq devices) depending on how much juice they are
> asking for; see cpufreq_get_requested_power() and
> devfreq_cooling_get_requested_power(). There's also some weighing involved.
>

I'm aware of those.  Main problem is that the current mechanism for IPA
to figure out the requested power of each actor is based on a rough
estimate of their past power consumption: If an actor was operating at a
highly energy-inefficient regime it will end up requesting more power
than another actor with the same load but more energy-efficient
behavior.  The IPA power allocator is therefore ineffective at improving
the energy efficiency of an agent beyond its past behavior --
Furthermore it seems to *rely* on individual agents being somewhat
energetically responsible in order for its power allocation result to be
anywhere close to optimal.  But improving the energy efficiency of an
agent seems useful in its own right, whether IPA is used to balance
power across agents or not.  That's precisely the purpose of this
series.

> If you look at the cpufreq cooling side of things, you'll see it also uses
> the PM QoS interface. For instance, should IPA decide to cap the CPUs
> (perhaps because say the GPU is the one drawing most of the juice), it'll
> lead to a maximum frequency capping request.
>
> So it does sound like that's what you want, only not just when thermally
> constrained.

Capping the CPU frequency from random device drivers is highly
problematic, because the CPU is a shared resource which a number of
different concurrent applications might be using beyond the GPU client.
The GPU driver has no visibility over its impact on the performance of
other applications.  And even in a single-task environment, in order to
behave as effectively as the present series the GPU driver would need to
monitor the utilization of *all* CPUs in the system and place a
frequency constraint on each one of them (since there is the potential
of the task scheduler migrating the process from one CPU to another
without notice).  Furthermore these frequency constraints would need to
be updated at high frequency in order to avoid performance degradation
whenever the balance of load between CPU and IO device fluctuates.

The present series attempts to remove the burden of computing frequency
constraints out of individual device drivers into the CPUFREQ governor.
Instead the device driver provides a response latency constraint when it
encounters a bottleneck, which can be more easily derived from hardware
and protocol characteristics than a CPU frequency.  PM QoS aggregates
the response latency constraints provided by all applications and gives
CPUFREQ a single response latency target compatible with all of them (so
a device driver specifying a high latency target won't lead to
performance degradation in a concurrent application with lower latency
constraints).  The CPUFREQ governor then computes frequency constraints
for each CPU core that minimize energy usage without limiting
throughput, based on the results obtained from CPU performance counters,
while guaranteeing that a discontinuous transition in CPU utilization
leads to a proportional transition in the CPU frequency before the
specified response latency has elapsed.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-14 10:26       ` [Intel-gfx] " Rafael J. Wysocki
@ 2020-05-15  0:48         ` Francisco Jerez
  -1 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-05-15  0:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Pandruvada, Srinivas, linux-pm, intel-gfx,
	chris.p.wilson, Vivi, Rodrigo, rui.zhang, daniel.lezcano,
	amit.kucheria


[-- Attachment #1.1: Type: text/plain, Size: 4953 bytes --]

"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Monday, May 11, 2020 11:01:41 PM CEST Francisco Jerez wrote:
>> 
>> --==-=-=
>> Content-Type: multipart/mixed; boundary="=-=-="
>> 
>> --=-=-=
>> Content-Type: text/plain; charset=utf-8
>> Content-Disposition: inline
>> Content-Transfer-Encoding: quoted-printable
>> 
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
>> >> This addresses the technical concerns people brought up about my
>> >> previous v2 revision of this series.  Other than a few bug fixes, the
>> >> only major change relative to v2 is that the controller is now exposed
>> >> as a new CPUFREQ generic governor as requested by Rafael (named
>> >> "adaptive" in this RFC though other naming suggestions are welcome).
>> >> Main reason for calling this v2.99 rather than v3 is that I haven't
>> >> yet addressed all the documentation requests from the v2 thread --
>> >> Will spend some time doing that as soon as I have an ACK (ideally from
>> >> Rafael) that things are moving in the right direction.
>> >>=20
>> >> You can also find this series along with the WIP code for non-HWP
>> >> platforms in this branch:
>> >>=20
>> >> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
>> >>=20
>> >> Thanks!
>> >>=20
>> >> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
>> >> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency ba=
>> sed on GPU load.
>> >> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters =
>> via debugfs.
>> >> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
>> >> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_upd=
>> ate_util_hook() and intel_pstate_set_update_util_hook().
>> >> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_u=
>> til_hook() once from the setpolicy hook.
>> >> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller stati=
>> stics and target range calculation.
>> >> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for H=
>> WP parts.
>> >> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on=
>>  ACPI FADT profile and CPUID.
>> >> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP c=
>> ontroller status.
>> >> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controlle=
>> r parameters via debugfs.
>> >
>> > What I'm missing is an explanation for why this isn't using the
>> > infrastructure that was build for these kinds of things? The thermal
>> > framework, was AFAIU, supposed to help with these things, and the IPA
>> > thing in particular is used by ARM to do exactly this GPU/CPU power
>> > budget thing.
>> >
>> > If thermal/IPA is found wanting, why aren't we improving that?
>> 
>> The GPU/CPU power budget "thing" is only a positive side effect of this
>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>> energy efficiency of workloads which have a bottleneck on a device other
>> than the CPU, by giving the bottlenecking device driver some influence
>> over the response latency of CPUFREQ governors via a PM QoS interface.
>> This seems to be completely outside the scope of the thermal framework
>> and IPA AFAIU.
>> 
>> >
>> > How much of that ADAPTIVE crud is actually intel_pstate specific? On a
>> > (really) quick read it appears to me that much of the controller bits
>> > there can be applied more generic, and thus should not be part of any
>> > one governor.
>> >
>> 
>> The implementation of that is intel_pstate-specific right now, but the
>> basic algorithm could be made to work on any other governor in
>> principle, which is why it is exposed as a generic CPUFREQ governor.  I
>> don't care about taking out the generic CPUFREQ governor changes if you
>> don't like them, and going back to some driver-specific means of turning
>> it on and off (though Rafael might disagree with that).
>> 
>> > Specifically, I want to use sched_util as cpufreq governor and use the
>> > intel_pstate as a passive driver.
>> 
>> Yeah, getting a similar optimization into the schedutil governor has
>> been on my wish list for a while, but I haven't had the time to get very
>> far on that except for a handful of hacks.  The intel_pstate handling is
>> going to be necessary anyway in order to handle HWP systems gracefully,
>> at least in the near future until schedutil becomes a viable alternative
>> to intel_pstate in active mode on HWP systems.
>
> FWIW, work is under way to make intel_pstate in the passive mode work on HWP
> systems.
>
> I have a prototype patch for that, but it can be improved.  I'll post a new
> version of it for review, possibly next week.
>

Looking forward to that, feel free to CC me on it.

> Cheers!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-15  0:48         ` Francisco Jerez
  0 siblings, 0 replies; 41+ messages in thread
From: Francisco Jerez @ 2020-05-15  0:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: amit.kucheria, linux-pm, Peter Zijlstra, intel-gfx,
	daniel.lezcano, chris.p.wilson, Pandruvada, Srinivas, rui.zhang


[-- Attachment #1.1.1: Type: text/plain, Size: 4953 bytes --]

"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Monday, May 11, 2020 11:01:41 PM CEST Francisco Jerez wrote:
>> 
>> --==-=-=
>> Content-Type: multipart/mixed; boundary="=-=-="
>> 
>> --=-=-=
>> Content-Type: text/plain; charset=utf-8
>> Content-Disposition: inline
>> Content-Transfer-Encoding: quoted-printable
>> 
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Mon, Apr 27, 2020 at 08:22:47PM -0700, Francisco Jerez wrote:
>> >> This addresses the technical concerns people brought up about my
>> >> previous v2 revision of this series.  Other than a few bug fixes, the
>> >> only major change relative to v2 is that the controller is now exposed
>> >> as a new CPUFREQ generic governor as requested by Rafael (named
>> >> "adaptive" in this RFC though other naming suggestions are welcome).
>> >> Main reason for calling this v2.99 rather than v3 is that I haven't
>> >> yet addressed all the documentation requests from the v2 thread --
>> >> Will spend some time doing that as soon as I have an ACK (ideally from
>> >> Rafael) that things are moving in the right direction.
>> >>=20
>> >> You can also find this series along with the WIP code for non-HWP
>> >> platforms in this branch:
>> >>=20
>> >> https://github.com/curro/linux/tree/intel_pstate-vlp-v2.99
>> >>=20
>> >> Thanks!
>> >>=20
>> >> [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit.
>> >> [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency ba=
>> sed on GPU load.
>> >> [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters =
>> via debugfs.
>> >> [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy.
>> >> [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_upd=
>> ate_util_hook() and intel_pstate_set_update_util_hook().
>> >> [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_u=
>> til_hook() once from the setpolicy hook.
>> >> [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller stati=
>> stics and target range calculation.
>> >> [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for H=
>> WP parts.
>> >> [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on=
>>  ACPI FADT profile and CPUID.
>> >> [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP c=
>> ontroller status.
>> >> [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controlle=
>> r parameters via debugfs.
>> >
>> > What I'm missing is an explanation for why this isn't using the
>> > infrastructure that was build for these kinds of things? The thermal
>> > framework, was AFAIU, supposed to help with these things, and the IPA
>> > thing in particular is used by ARM to do exactly this GPU/CPU power
>> > budget thing.
>> >
>> > If thermal/IPA is found wanting, why aren't we improving that?
>> 
>> The GPU/CPU power budget "thing" is only a positive side effect of this
>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>> energy efficiency of workloads which have a bottleneck on a device other
>> than the CPU, by giving the bottlenecking device driver some influence
>> over the response latency of CPUFREQ governors via a PM QoS interface.
>> This seems to be completely outside the scope of the thermal framework
>> and IPA AFAIU.
>> 
>> >
>> > How much of that ADAPTIVE crud is actually intel_pstate specific? On a
>> > (really) quick read it appears to me that much of the controller bits
>> > there can be applied more generic, and thus should not be part of any
>> > one governor.
>> >
>> 
>> The implementation of that is intel_pstate-specific right now, but the
>> basic algorithm could be made to work on any other governor in
>> principle, which is why it is exposed as a generic CPUFREQ governor.  I
>> don't care about taking out the generic CPUFREQ governor changes if you
>> don't like them, and going back to some driver-specific means of turning
>> it on and off (though Rafael might disagree with that).
>> 
>> > Specifically, I want to use sched_util as cpufreq governor and use the
>> > intel_pstate as a passive driver.
>> 
>> Yeah, getting a similar optimization into the schedutil governor has
>> been on my wish list for a while, but I haven't had the time to get very
>> far on that except for a handful of hacks.  The intel_pstate handling is
>> going to be necessary anyway in order to handle HWP systems gracefully,
>> at least in the near future until schedutil becomes a viable alternative
>> to intel_pstate in active mode on HWP systems.
>
> FWIW, work is under way to make intel_pstate in the passive mode work on HWP
> systems.
>
> I have a prototype patch for that, but it can be improved.  I'll post a new
> version of it for review, possibly next week.
>

Looking forward to that, feel free to CC me on it.

> Cheers!

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-15  0:48         ` [Intel-gfx] " Francisco Jerez
@ 2020-05-15 18:09           ` Valentin Schneider
  -1 siblings, 0 replies; 41+ messages in thread
From: Valentin Schneider @ 2020-05-15 18:09 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas,
	linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, rui.zhang,
	daniel.lezcano, amit.kucheria, Lukasz Luba


On 15/05/20 01:48, Francisco Jerez wrote:
> Valentin Schneider <valentin.schneider@arm.com> writes:
>
>> (+Lukasz)
>>
>> On 11/05/20 22:01, Francisco Jerez wrote:
>>>> What I'm missing is an explanation for why this isn't using the
>>>> infrastructure that was build for these kinds of things? The thermal
>>>> framework, was AFAIU, supposed to help with these things, and the IPA
>>>> thing in particular is used by ARM to do exactly this GPU/CPU power
>>>> budget thing.
>>>>
>>>> If thermal/IPA is found wanting, why aren't we improving that?
>>>
>>> The GPU/CPU power budget "thing" is only a positive side effect of this
>>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>>> energy efficiency of workloads which have a bottleneck on a device other
>>> than the CPU, by giving the bottlenecking device driver some influence
>>> over the response latency of CPUFREQ governors via a PM QoS interface.
>>> This seems to be completely outside the scope of the thermal framework
>>> and IPA AFAIU.
>>>
>>
>> It's been a while since I've stared at IPA, but it does sound vaguely
>> familiar.
>>
>> When thermally constrained, IPA figures out a budget and splits it between
>> actors (cpufreq and devfreq devices) depending on how much juice they are
>> asking for; see cpufreq_get_requested_power() and
>> devfreq_cooling_get_requested_power(). There's also some weighing involved.
>>
>
> I'm aware of those.  Main problem is that the current mechanism for IPA
> to figure out the requested power of each actor is based on a rough
> estimate of their past power consumption: If an actor was operating at a
> highly energy-inefficient regime it will end up requesting more power
> than another actor with the same load but more energy-efficient
> behavior.

Right, we do mix load (busy time for either cpufreq and devfreq devices
AFAIR) and current state (freq) into one single power value.

> The IPA power allocator is therefore ineffective at improving
> the energy efficiency of an agent beyond its past behavior --
> Furthermore it seems to *rely* on individual agents being somewhat
> energetically responsible in order for its power allocation result to be
> anywhere close to optimal.  But improving the energy efficiency of an
> agent seems useful in its own right, whether IPA is used to balance
> power across agents or not.  That's precisely the purpose of this
> series.
>
>> If you look at the cpufreq cooling side of things, you'll see it also uses
>> the PM QoS interface. For instance, should IPA decide to cap the CPUs
>> (perhaps because say the GPU is the one drawing most of the juice), it'll
>> lead to a maximum frequency capping request.
>>
>> So it does sound like that's what you want, only not just when thermally
>> constrained.
>
> Capping the CPU frequency from random device drivers is highly
> problematic, because the CPU is a shared resource which a number of
> different concurrent applications might be using beyond the GPU client.
> The GPU driver has no visibility over its impact on the performance of
> other applications.  And even in a single-task environment, in order to
> behave as effectively as the present series the GPU driver would need to
> monitor the utilization of *all* CPUs in the system and place a
> frequency constraint on each one of them (since there is the potential
> of the task scheduler migrating the process from one CPU to another
> without notice).  Furthermore these frequency constraints would need to
> be updated at high frequency in order to avoid performance degradation
> whenever the balance of load between CPU and IO device fluctuates.
>
> The present series attempts to remove the burden of computing frequency
> constraints out of individual device drivers into the CPUFREQ governor.
> Instead the device driver provides a response latency constraint when it
> encounters a bottleneck, which can be more easily derived from hardware
> and protocol characteristics than a CPU frequency.  PM QoS aggregates
> the response latency constraints provided by all applications and gives
> CPUFREQ a single response latency target compatible with all of them (so
> a device driver specifying a high latency target won't lead to
> performance degradation in a concurrent application with lower latency
> constraints).  The CPUFREQ governor then computes frequency constraints
> for each CPU core that minimize energy usage without limiting
> throughput, based on the results obtained from CPU performance counters,
> while guaranteeing that a discontinuous transition in CPU utilization
> leads to a proportional transition in the CPU frequency before the
> specified response latency has elapsed.

Right, I think I see your point there. I'm thinking the 'actual' IPA gurus
(Lukasz or even Javi) may want to have a look at this.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-15 18:09           ` Valentin Schneider
  0 siblings, 0 replies; 41+ messages in thread
From: Valentin Schneider @ 2020-05-15 18:09 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: amit.kucheria, linux-pm, Peter Zijlstra, intel-gfx,
	daniel.lezcano, Rafael J. Wysocki, chris.p.wilson, Pandruvada,
	Srinivas, rui.zhang, Lukasz Luba


On 15/05/20 01:48, Francisco Jerez wrote:
> Valentin Schneider <valentin.schneider@arm.com> writes:
>
>> (+Lukasz)
>>
>> On 11/05/20 22:01, Francisco Jerez wrote:
>>>> What I'm missing is an explanation for why this isn't using the
>>>> infrastructure that was build for these kinds of things? The thermal
>>>> framework, was AFAIU, supposed to help with these things, and the IPA
>>>> thing in particular is used by ARM to do exactly this GPU/CPU power
>>>> budget thing.
>>>>
>>>> If thermal/IPA is found wanting, why aren't we improving that?
>>>
>>> The GPU/CPU power budget "thing" is only a positive side effect of this
>>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>>> energy efficiency of workloads which have a bottleneck on a device other
>>> than the CPU, by giving the bottlenecking device driver some influence
>>> over the response latency of CPUFREQ governors via a PM QoS interface.
>>> This seems to be completely outside the scope of the thermal framework
>>> and IPA AFAIU.
>>>
>>
>> It's been a while since I've stared at IPA, but it does sound vaguely
>> familiar.
>>
>> When thermally constrained, IPA figures out a budget and splits it between
>> actors (cpufreq and devfreq devices) depending on how much juice they are
>> asking for; see cpufreq_get_requested_power() and
>> devfreq_cooling_get_requested_power(). There's also some weighing involved.
>>
>
> I'm aware of those.  Main problem is that the current mechanism for IPA
> to figure out the requested power of each actor is based on a rough
> estimate of their past power consumption: If an actor was operating at a
> highly energy-inefficient regime it will end up requesting more power
> than another actor with the same load but more energy-efficient
> behavior.

Right, we do mix load (busy time for either cpufreq and devfreq devices
AFAIR) and current state (freq) into one single power value.

> The IPA power allocator is therefore ineffective at improving
> the energy efficiency of an agent beyond its past behavior --
> Furthermore it seems to *rely* on individual agents being somewhat
> energetically responsible in order for its power allocation result to be
> anywhere close to optimal.  But improving the energy efficiency of an
> agent seems useful in its own right, whether IPA is used to balance
> power across agents or not.  That's precisely the purpose of this
> series.
>
>> If you look at the cpufreq cooling side of things, you'll see it also uses
>> the PM QoS interface. For instance, should IPA decide to cap the CPUs
>> (perhaps because say the GPU is the one drawing most of the juice), it'll
>> lead to a maximum frequency capping request.
>>
>> So it does sound like that's what you want, only not just when thermally
>> constrained.
>
> Capping the CPU frequency from random device drivers is highly
> problematic, because the CPU is a shared resource which a number of
> different concurrent applications might be using beyond the GPU client.
> The GPU driver has no visibility over its impact on the performance of
> other applications.  And even in a single-task environment, in order to
> behave as effectively as the present series the GPU driver would need to
> monitor the utilization of *all* CPUs in the system and place a
> frequency constraint on each one of them (since there is the potential
> of the task scheduler migrating the process from one CPU to another
> without notice).  Furthermore these frequency constraints would need to
> be updated at high frequency in order to avoid performance degradation
> whenever the balance of load between CPU and IO device fluctuates.
>
> The present series attempts to remove the burden of computing frequency
> constraints out of individual device drivers into the CPUFREQ governor.
> Instead the device driver provides a response latency constraint when it
> encounters a bottleneck, which can be more easily derived from hardware
> and protocol characteristics than a CPU frequency.  PM QoS aggregates
> the response latency constraints provided by all applications and gives
> CPUFREQ a single response latency target compatible with all of them (so
> a device driver specifying a high latency target won't lead to
> performance degradation in a concurrent application with lower latency
> constraints).  The CPUFREQ governor then computes frequency constraints
> for each CPU core that minimize energy usage without limiting
> throughput, based on the results obtained from CPU performance counters,
> while guaranteeing that a discontinuous transition in CPU utilization
> leads to a proportional transition in the CPU frequency before the
> specified response latency has elapsed.

Right, I think I see your point there. I'm thinking the 'actual' IPA gurus
(Lukasz or even Javi) may want to have a look at this.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
  2020-05-15 18:09           ` [Intel-gfx] " Valentin Schneider
@ 2020-05-28  9:29             ` Lukasz Luba
  -1 siblings, 0 replies; 41+ messages in thread
From: Lukasz Luba @ 2020-05-28  9:29 UTC (permalink / raw)
  To: Valentin Schneider, Francisco Jerez
  Cc: Peter Zijlstra, Rafael J. Wysocki, Pandruvada, Srinivas,
	linux-pm, intel-gfx, chris.p.wilson, Vivi, Rodrigo, rui.zhang,
	daniel.lezcano, amit.kucheria



On 5/15/20 7:09 PM, Valentin Schneider wrote:
> 
> On 15/05/20 01:48, Francisco Jerez wrote:
>> Valentin Schneider <valentin.schneider@arm.com> writes:
>>
>>> (+Lukasz)
>>>
>>> On 11/05/20 22:01, Francisco Jerez wrote:
>>>>> What I'm missing is an explanation for why this isn't using the
>>>>> infrastructure that was build for these kinds of things? The thermal
>>>>> framework, was AFAIU, supposed to help with these things, and the IPA
>>>>> thing in particular is used by ARM to do exactly this GPU/CPU power
>>>>> budget thing.
>>>>>
>>>>> If thermal/IPA is found wanting, why aren't we improving that?
>>>>
>>>> The GPU/CPU power budget "thing" is only a positive side effect of this
>>>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>>>> energy efficiency of workloads which have a bottleneck on a device other
>>>> than the CPU, by giving the bottlenecking device driver some influence
>>>> over the response latency of CPUFREQ governors via a PM QoS interface.
>>>> This seems to be completely outside the scope of the thermal framework
>>>> and IPA AFAIU.
>>>>
>>>
>>> It's been a while since I've stared at IPA, but it does sound vaguely
>>> familiar.
>>>
>>> When thermally constrained, IPA figures out a budget and splits it between
>>> actors (cpufreq and devfreq devices) depending on how much juice they are
>>> asking for; see cpufreq_get_requested_power() and
>>> devfreq_cooling_get_requested_power(). There's also some weighing involved.
>>>
>>
>> I'm aware of those.  Main problem is that the current mechanism for IPA
>> to figure out the requested power of each actor is based on a rough
>> estimate of their past power consumption: If an actor was operating at a
>> highly energy-inefficient regime it will end up requesting more power
>> than another actor with the same load but more energy-efficient
>> behavior.

This can be tweaked by changing the weight of an actor (unfortunately
not in a real-time by kernel). We usually setup them once, in DT.
So, it's possible to set different weight for the LITTLE cores (which
are more energy-efficient) and the big cores (in a good way or bad).

> 
> Right, we do mix load (busy time for either cpufreq and devfreq devices
> AFAIR) and current state (freq) into one single power value.
> 
>> The IPA power allocator is therefore ineffective at improving
>> the energy efficiency of an agent beyond its past behavior --
>> Furthermore it seems to *rely* on individual agents being somewhat
>> energetically responsible in order for its power allocation result to be
>> anywhere close to optimal.  But improving the energy efficiency of an
>> agent seems useful in its own right, whether IPA is used to balance
>> power across agents or not.  That's precisely the purpose of this
>> series.

I don't fully agree here, i.e. in a properly setup platform we promote
more energy-efficient LITTLE cores when there is a limited power budget. 
That would cause capping on big cores and scheduler should see it.
There are some limitations in the IPA, but the requirements where
different back then, mainline code was different, etc.

>>
>>> If you look at the cpufreq cooling side of things, you'll see it also uses
>>> the PM QoS interface. For instance, should IPA decide to cap the CPUs
>>> (perhaps because say the GPU is the one drawing most of the juice), it'll
>>> lead to a maximum frequency capping request.
>>>
>>> So it does sound like that's what you want, only not just when thermally
>>> constrained.
>>
>> Capping the CPU frequency from random device drivers is highly
>> problematic, because the CPU is a shared resource which a number of
>> different concurrent applications might be using beyond the GPU client.
>> The GPU driver has no visibility over its impact on the performance of
>> other applications.  And even in a single-task environment, in order to
>> behave as effectively as the present series the GPU driver would need to
>> monitor the utilization of *all* CPUs in the system and place a
>> frequency constraint on each one of them (since there is the potential
>> of the task scheduler migrating the process from one CPU to another
>> without notice).  Furthermore these frequency constraints would need to
>> be updated at high frequency in order to avoid performance degradation
>> whenever the balance of load between CPU and IO device fluctuates.
>>
>> The present series attempts to remove the burden of computing frequency
>> constraints out of individual device drivers into the CPUFREQ governor.
>> Instead the device driver provides a response latency constraint when it
>> encounters a bottleneck, which can be more easily derived from hardware
>> and protocol characteristics than a CPU frequency.  PM QoS aggregates
>> the response latency constraints provided by all applications and gives
>> CPUFREQ a single response latency target compatible with all of them (so
>> a device driver specifying a high latency target won't lead to
>> performance degradation in a concurrent application with lower latency
>> constraints).  The CPUFREQ governor then computes frequency constraints
>> for each CPU core that minimize energy usage without limiting
>> throughput, based on the results obtained from CPU performance counters,
>> while guaranteeing that a discontinuous transition in CPU utilization
>> leads to a proportional transition in the CPU frequency before the
>> specified response latency has elapsed.
> 
> Right, I think I see your point there. I'm thinking the 'actual' IPA gurus
> (Lukasz or even Javi) may want to have a look at this.
> 

This patch set AFAIU has different goals than IPA or any other thermal
governor.

I don't know the details of this Intel platform and the mechanisms
which are there for thermal and power budget, so I might be wrong in
some points (correct me where needed).

Main differences comparing to IPA in regards the platform:
- the series works on a platform which does not actually control the
frequency (AFAIK Intel freq can be changed by FW due to any reason).
IPA has been designed for platform which has full control over the
frequency.
- It does not work on Heterogeneous CPUs.
IPA is aware of big, LITTLE or even a different tracks used
- this patch set ignores the temperature probably assuming it is done
by something else (FW or thermal governor).
IPA has the PID built on top of temp sensor and must control it.

Different platforms, different behaviors, different requirements.
I agree IPA has to catch up with the new mainline solutions, though.

Regards,
Lukasz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99)
@ 2020-05-28  9:29             ` Lukasz Luba
  0 siblings, 0 replies; 41+ messages in thread
From: Lukasz Luba @ 2020-05-28  9:29 UTC (permalink / raw)
  To: Valentin Schneider, Francisco Jerez
  Cc: amit.kucheria, linux-pm, Peter Zijlstra, intel-gfx,
	daniel.lezcano, Rafael J. Wysocki, chris.p.wilson, Pandruvada,
	Srinivas, rui.zhang



On 5/15/20 7:09 PM, Valentin Schneider wrote:
> 
> On 15/05/20 01:48, Francisco Jerez wrote:
>> Valentin Schneider <valentin.schneider@arm.com> writes:
>>
>>> (+Lukasz)
>>>
>>> On 11/05/20 22:01, Francisco Jerez wrote:
>>>>> What I'm missing is an explanation for why this isn't using the
>>>>> infrastructure that was build for these kinds of things? The thermal
>>>>> framework, was AFAIU, supposed to help with these things, and the IPA
>>>>> thing in particular is used by ARM to do exactly this GPU/CPU power
>>>>> budget thing.
>>>>>
>>>>> If thermal/IPA is found wanting, why aren't we improving that?
>>>>
>>>> The GPU/CPU power budget "thing" is only a positive side effect of this
>>>> series on some TDP-bound systems.  Its ultimate purpose is improving the
>>>> energy efficiency of workloads which have a bottleneck on a device other
>>>> than the CPU, by giving the bottlenecking device driver some influence
>>>> over the response latency of CPUFREQ governors via a PM QoS interface.
>>>> This seems to be completely outside the scope of the thermal framework
>>>> and IPA AFAIU.
>>>>
>>>
>>> It's been a while since I've stared at IPA, but it does sound vaguely
>>> familiar.
>>>
>>> When thermally constrained, IPA figures out a budget and splits it between
>>> actors (cpufreq and devfreq devices) depending on how much juice they are
>>> asking for; see cpufreq_get_requested_power() and
>>> devfreq_cooling_get_requested_power(). There's also some weighing involved.
>>>
>>
>> I'm aware of those.  Main problem is that the current mechanism for IPA
>> to figure out the requested power of each actor is based on a rough
>> estimate of their past power consumption: If an actor was operating at a
>> highly energy-inefficient regime it will end up requesting more power
>> than another actor with the same load but more energy-efficient
>> behavior.

This can be tweaked by changing the weight of an actor (unfortunately
not in a real-time by kernel). We usually setup them once, in DT.
So, it's possible to set different weight for the LITTLE cores (which
are more energy-efficient) and the big cores (in a good way or bad).

> 
> Right, we do mix load (busy time for either cpufreq and devfreq devices
> AFAIR) and current state (freq) into one single power value.
> 
>> The IPA power allocator is therefore ineffective at improving
>> the energy efficiency of an agent beyond its past behavior --
>> Furthermore it seems to *rely* on individual agents being somewhat
>> energetically responsible in order for its power allocation result to be
>> anywhere close to optimal.  But improving the energy efficiency of an
>> agent seems useful in its own right, whether IPA is used to balance
>> power across agents or not.  That's precisely the purpose of this
>> series.

I don't fully agree here, i.e. in a properly setup platform we promote
more energy-efficient LITTLE cores when there is a limited power budget. 
That would cause capping on big cores and scheduler should see it.
There are some limitations in the IPA, but the requirements where
different back then, mainline code was different, etc.

>>
>>> If you look at the cpufreq cooling side of things, you'll see it also uses
>>> the PM QoS interface. For instance, should IPA decide to cap the CPUs
>>> (perhaps because say the GPU is the one drawing most of the juice), it'll
>>> lead to a maximum frequency capping request.
>>>
>>> So it does sound like that's what you want, only not just when thermally
>>> constrained.
>>
>> Capping the CPU frequency from random device drivers is highly
>> problematic, because the CPU is a shared resource which a number of
>> different concurrent applications might be using beyond the GPU client.
>> The GPU driver has no visibility over its impact on the performance of
>> other applications.  And even in a single-task environment, in order to
>> behave as effectively as the present series the GPU driver would need to
>> monitor the utilization of *all* CPUs in the system and place a
>> frequency constraint on each one of them (since there is the potential
>> of the task scheduler migrating the process from one CPU to another
>> without notice).  Furthermore these frequency constraints would need to
>> be updated at high frequency in order to avoid performance degradation
>> whenever the balance of load between CPU and IO device fluctuates.
>>
>> The present series attempts to remove the burden of computing frequency
>> constraints out of individual device drivers into the CPUFREQ governor.
>> Instead the device driver provides a response latency constraint when it
>> encounters a bottleneck, which can be more easily derived from hardware
>> and protocol characteristics than a CPU frequency.  PM QoS aggregates
>> the response latency constraints provided by all applications and gives
>> CPUFREQ a single response latency target compatible with all of them (so
>> a device driver specifying a high latency target won't lead to
>> performance degradation in a concurrent application with lower latency
>> constraints).  The CPUFREQ governor then computes frequency constraints
>> for each CPU core that minimize energy usage without limiting
>> throughput, based on the results obtained from CPU performance counters,
>> while guaranteeing that a discontinuous transition in CPU utilization
>> leads to a proportional transition in the CPU frequency before the
>> specified response latency has elapsed.
> 
> Right, I think I see your point there. I'm thinking the 'actual' IPA gurus
> (Lukasz or even Javi) may want to have a look at this.
> 

This patch set AFAIU has different goals than IPA or any other thermal
governor.

I don't know the details of this Intel platform and the mechanisms
which are there for thermal and power budget, so I might be wrong in
some points (correct me where needed).

Main differences comparing to IPA in regards the platform:
- the series works on a platform which does not actually control the
frequency (AFAIK Intel freq can be changed by FW due to any reason).
IPA has been designed for platform which has full control over the
frequency.
- It does not work on Heterogeneous CPUs.
IPA is aware of big, LITTLE or even a different tracks used
- this patch set ignores the temperature probably assuming it is done
by something else (FW or thermal governor).
IPA has the PID built on top of temp sensor and must control it.

Different platforms, different behaviors, different requirements.
I agree IPA has to catch up with the new mainline solutions, though.

Regards,
Lukasz

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-05-28 13:05 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28  3:22 [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99) Francisco Jerez
2020-04-28  3:22 ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 02/11] drm/i915: Adjust PM QoS scaling response frequency based on GPU load Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 03/11] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 04/11] cpufreq: Define ADAPTIVE frequency governor policy Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 05/11] cpufreq: intel_pstate: Reorder intel_pstate_clear_update_util_hook() and intel_pstate_set_update_util_hook() Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 06/11] cpufreq: intel_pstate: Call intel_pstate_set_update_util_hook() once from the setpolicy hook Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 07/11] cpufreq: intel_pstate: Implement VLP controller statistics and target range calculation Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 08/11] cpufreq: intel_pstate: Implement VLP controller for HWP parts Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 09/11] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 10/11] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:22 ` [PATCHv2.99 11/11] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs Francisco Jerez
2020-04-28  3:22   ` [Intel-gfx] " Francisco Jerez
2020-04-28  3:32 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [PATCHv2.99,01/11] PM: QoS: Add CPU_SCALING_RESPONSE global PM QoS limit Patchwork
2020-05-11 10:57 ` [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2.99) Peter Zijlstra
2020-05-11 10:57   ` [Intel-gfx] " Peter Zijlstra
2020-05-11 21:01   ` Francisco Jerez
2020-05-11 21:01     ` [Intel-gfx] " Francisco Jerez
2020-05-14 10:26     ` Rafael J. Wysocki
2020-05-14 10:26       ` [Intel-gfx] " Rafael J. Wysocki
2020-05-15  0:48       ` Francisco Jerez
2020-05-15  0:48         ` [Intel-gfx] " Francisco Jerez
2020-05-14 11:50     ` Valentin Schneider
2020-05-14 11:50       ` [Intel-gfx] " Valentin Schneider
2020-05-15  0:48       ` Francisco Jerez
2020-05-15  0:48         ` [Intel-gfx] " Francisco Jerez
2020-05-15 18:09         ` Valentin Schneider
2020-05-15 18:09           ` [Intel-gfx] " Valentin Schneider
2020-05-28  9:29           ` Lukasz Luba
2020-05-28  9:29             ` [Intel-gfx] " Lukasz Luba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.