linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data
@ 2016-02-21 23:16 Rafael J. Wysocki
  2016-02-21 23:18 ` [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-21 23:16 UTC (permalink / raw)
  To: Linux PM list, Juri Lelli
  Cc: Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Steve Muckle, Ingo Molnar

Hi Everyone,

Usually, I don't send introductory messages for single patches, but this
one is an exception, because I didn't want to put all of my considerations
into the patch changelog.

So I have been told for a few times already that I should not introduce
interfaces passing arguments that aren't used in the current code and without
telling anyone what my plans for using those aguments in the future may be
(although IMO that would not be too hard to figure out), so here's an example.

Juri, that's not what you may have expected.  In fact, I didn't expect it to
look like this either when I started to think about it.  Initially, I was
considering to modify the existing governors to use the utilization data
somehow, but then I realized that it would make them behave differently and
that might confuse some.

So here it is: a new functional cpufreq governor.  It is very simple (arguably
on the verge of being overly simplistic), but it gets the job done.  I have only
tested it (very lightly) on a system with one CPU per cpufreq policy (so the
"shared" path in it is admittedly untested), but in that simple case the
frequency evidently follows the CPU utilization as expected.

The reason why I didn't post it earlier was because I needed to clean up the
existing governor code enough to be able to do anything new on top of it (you
might have noticed the cleanup work going during the last couple of weeks).

Now, there are a few observations to be made about it that may be interesting
to someone (they definitely are interesting to me).  Some of them are mentioned
in the patch changelog too.

First off, note how simple it is: 250 lines of code including struct definitions
and the boilerplate part (and the copyright notice and all).  It might be quite
difficult to invent something simpler and still doing significant work.

As is, it may not make the best scaling decisions (in particular, it will tend
to over-provision DL tasks), but at least it sould be very predictable.  I might
have added things like up_threshold and sampling_down_factor to it, but I decided
against doing that as it would muddy the waters a bit.  Also, when I had tested
it, it looked aggressive enough to me without those.

Second, note that the majority of work in it is done in the callbacks invoked
from scheduler code paths.  If cpufreq drivers are modified to provide a "fast
frequency update" method that will be practical to invoke from those paths, *all*
of the work in that governor may be done in there.  It's almost like the scheduler
telling the frequency scaling driver directly "this is your frequency, use it".

Next, it is hooked up to the existing cpufreq governor infrastructure which
allows the existing sysfs interface that people are used to and familiar with to
be used with it.  That also allows any existing cpufreq drivers to be used with
the new governor without any modifications, so if you are interested in how it
compares with "ondemand" and "conservative", apply the patch, build the new
governor into the kernel and echo "schedutil" to "scaling_governor" for your CPUs. :-)

[It cannot be made the default cpufreq governor ATM (for a bit of safety), but
that can be changed easily enough if someone wants to.]

Further, it is a "sampling governor" on the surface, but this really is not a
hard requirement.  In fact, it is quite straightforward to notice that util and
max are used directly as obtained from the scheduler without any sampling.  If
my understanding of the relevant CFS code is correct, util already contains
contributions form what happened in the past, so it should be fine to use it as
provided.

The sampling rate really plays the role of a rate limit for frequency updates.
The current code rather needs that because of the way it updates the CPU frequency
(from a work item run in process context), but if (at least some) cpufreq drivers
are taught to update the frequency "on the fly", it should be possible to dispense
with the sampling.  Of course, we still may find that rate limitting CPU
frequency changes is generally useful, but there may be special "emergency"
updates from the scheduler that will be reacted to immediately without
waiting for the whole "sampling period" to pass, for example.

Moreover, the new governor departs from the "let's code for the most complicated
case and the simpler ones will be handled automatically" approach that seems to
have been used throughout cpufreq, as it explicitly makes the "one CPU per cpufreq
policy" case special.  In that case, the values of util and max are not even
stored in the governor's data structures, but used immediately.  That allows it
to reduce the extra overhead from itself when possible.

Finally, but not least importantly, the new governor is completely generic.  It
doesn't depend on any system-specific or architecture-specific magic (other than
the policy sharing on systems where groups of CPUs have to be handled together)
to get the job done.  Thus it may be possible to use it as a base line for more
sophisticated frequency scaling solutions.

That last bit may be particularly important for systems where the only source
of information on the available frequency+voltage configurations of the CPUs
is something like ACPI tables and there is no information on the respective
cost of putting the CPUs into those configurations in terms of energy (and
no information on how much energy is consumed in the idle states available
on the given system).  With so little information on the "power topology" of
the system, so to speak, using the "frequency follows the utilization" rule
may simply be as good as it gets.  Even then (or maybe especially in those
cases), the frequency scaling mechanism should be reasonably lightweight and
effective, if possible, and this governor indicates that, indeed, that should
be possible to achieve.

There are two way in which this can be taken further.  The first, quite
obvious, one is to make it possible for cpufreq drivers to provide a method
for switching frequencies from interrupt context so as to avoid the need to
use the process-context work items for that, where possible.  The second one,
depending on the former, would be to try to eliminate the sampling rate and
simply update the frequency whenever the utilization changes and see how far
that would take us.  In addition to that, one may want to play with the
frequency selection formula (eg. to make it more or less aggressive etc).

The patch is on top of the linux-next branch of the linux-pm.git tree (that
should be part of the tomorrow's linux-next if all goes well), but it should
also apply on top of the pm-cpufreq-test branch in that tree (which only
contains changes related to cpufreq governors).

Please let me know what you think.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-21 23:16 [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
@ 2016-02-21 23:18 ` Rafael J. Wysocki
  2016-02-22 14:16   ` Juri Lelli
  2016-02-24  1:20 ` [RFC/RFT][PATCH v2 0/2] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
  2016-03-03 14:27 ` [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Ingo Molnar
  2 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-21 23:18 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is very simple.  It is almost as simple as it
can be and remain reasonably functional.

The frequency selection formula used by it is essentially the same
as the one used by the "ondemand" governor, although it doesn't use
the additional up_threshold parameter, but instead of computing the
load as the "non-idle CPU time" to "total CPU time" ratio, it takes
the utilization data provided by CFS as input.  More specifically,
it represents "load" as the util/max ratio, where util and max
are the utilization and CPU capacity coming from CFS.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).  The only operation carried
out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
is a __cpufreq_driver_target() call to trigger a frequency update (to
a value already computed beforehand in one of the utilization update
handlers).  This means that, at least for some cpufreq drivers that
can update CPU frequency by doing simple register writes, it should
be possible to set the frequency in the utilization update handlers
too in which case all of the governor's activity would take place in
the scheduler paths invoking cpufreq_update_util() without the need
to run anything in process context.

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
specifically targeted at RT and DL tasks.

To some extent it relies on the common governor code in
cpufreq_governor.c and it uses that code in a somewhat unusual
way (different from what the "ondemand" and "conservative"
governors do), so some small and rather unintrusive changes
have to be made in that code and the other governors to support it.

However, after making it possible to set the CPU frequency from
the utilization update handlers, that new governor's interactions
with the common code might be limited to the initialization, cleanup
and handling of sysfs attributes (currently only one attribute,
sampling_rate, is supported in addition to the standard policy
attributes handled by the cpufreq core).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is on top of the linux-next branch of the linux-pm.git tree (that
should be part of the tomorrow's linux-next if all goes well), but it should
also apply on top of the pm-cpufreq-test branch in that tree (which only
contains changes related to cpufreq governors).

---
 drivers/cpufreq/Kconfig                |   15 +
 drivers/cpufreq/Makefile               |    1 
 drivers/cpufreq/cpufreq_conservative.c |    3 
 drivers/cpufreq/cpufreq_governor.c     |   21 +-
 drivers/cpufreq/cpufreq_governor.h     |    2 
 drivers/cpufreq/cpufreq_ondemand.c     |    3 
 drivers/cpufreq/cpufreq_schedutil.c    |  249 +++++++++++++++++++++++++++++++++
 7 files changed, 284 insertions(+), 10 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -164,7 +164,7 @@ struct dbs_governor {
 	void (*free)(struct policy_dbs_info *policy_dbs);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
-	void (*start)(struct cpufreq_policy *policy);
+	bool (*start)(struct cpufreq_policy *policy);
 };
 
 static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,249 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/percpu-defs.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_policy {
+	struct policy_dbs_info policy_dbs;
+	unsigned int next_freq;
+	raw_spinlock_t update_lock;  /* For shared policies */
+};
+
+static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
+{
+	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
+}
+
+struct sugov_cpu {
+	struct update_util_data update_util;
+	struct policy_dbs_info *policy_dbs;
+	/* The fields below are only needed when sharing a policy. */
+	unsigned long util;
+	unsigned long max;
+	u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
+				    unsigned long util, unsigned long max,
+				    u64 last_sample_time)
+{
+	struct cpufreq_policy *policy = policy_dbs->policy;
+	unsigned int min_f = policy->cpuinfo.min_freq;
+	unsigned int max_f = policy->cpuinfo.max_freq;
+	unsigned int j;
+
+	if (util > max || min_f >= max_f)
+		return max_f;
+
+	for_each_cpu(j, policy->cpus) {
+		struct sugov_cpu *j_sg_cpu;
+		unsigned long j_util, j_max;
+
+		if (j == smp_processor_id())
+			continue;
+
+		j_sg_cpu = &per_cpu(sugov_cpu, j);
+		/*
+		 * If the CPU was last updated before the previous sampling
+		 * time, we don't take it into account as it probably is idle
+		 * now.
+		 */
+		if (j_sg_cpu->last_update < last_sample_time)
+			continue;
+
+		j_util = j_sg_cpu->util;
+		j_max = j_sg_cpu->max;
+		if (j_util > j_max)
+			return max_f;
+
+		if (j_util * max > j_max * util) {
+			util = j_util;
+			max = j_max;
+		}
+	}
+
+	return min_f + util * (max_f - min_f) / max;
+}
+
+static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
+				unsigned int next_freq)
+{
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	sg_policy->next_freq = next_freq;
+	policy_dbs->last_sample_time = time;
+	policy_dbs->work_in_progress = true;
+	irq_work_queue(&policy_dbs->irq_work);
+}
+
+static void sugov_update_shared(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+	unsigned int next_f;
+	u64 delta_ns, lst;
+
+	raw_spin_lock(&sg_policy->update_lock);
+
+	sg_cpu->util = util;
+	sg_cpu->max = max;
+	sg_cpu->last_update = time;
+
+	if (policy_dbs->work_in_progress)
+		goto out;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	lst = policy_dbs->last_sample_time;
+	delta_ns = time - lst;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		goto out;
+
+	next_f = sugov_next_freq(policy_dbs, util, max, lst);
+
+	sugov_update_commit(policy_dbs, time, next_f);
+
+ out:
+	raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_update_single(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	unsigned int min_f, max_f, next_f;
+	u64 delta_ns;
+
+	if (policy_dbs->work_in_progress)
+		return;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	delta_ns = time - policy_dbs->last_sample_time;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		return;
+
+	min_f = policy_dbs->policy->cpuinfo.min_freq;
+	max_f = policy_dbs->policy->cpuinfo.max_freq;
+	next_f = util > max || min_f >= max_f ? max_f :
+			min_f + util * (max_f - min_f) / max;
+
+	sugov_update_commit(policy_dbs, time, next_f);
+}
+
+/************************** sysfs interface ************************/
+
+gov_show_one_common(sampling_rate);
+
+gov_attr_rw(sampling_rate);
+
+static struct attribute *sugov_attributes[] = {
+	&sampling_rate.attr,
+	NULL
+};
+
+/********************** dbs_governor interface *********************/
+
+static struct policy_dbs_info *sugov_alloc(void)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+	if (!sg_policy)
+		return NULL;
+
+	raw_spin_lock_init(&sg_policy->update_lock);
+	return &sg_policy->policy_dbs;
+}
+
+static void sugov_free(struct policy_dbs_info *policy_dbs)
+{
+	kfree(to_sg_policy(policy_dbs));
+}
+
+static bool sugov_start(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	unsigned int cpu;
+
+	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
+	policy_dbs->last_sample_time = 0;
+
+	for_each_cpu(cpu, policy->cpus) {
+		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+		sg_cpu->policy_dbs = policy_dbs;
+		if (policy_dbs->is_shared) {
+			sg_cpu->util = ULONG_MAX;
+			sg_cpu->max = 0;
+			sg_cpu->last_update = 0;
+			sg_cpu->update_util.func = sugov_update_shared;
+		} else {
+			sg_cpu->update_util.func = sugov_update_single;
+		}
+		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
+	}
+	return false;
+}
+
+static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
+	return policy_dbs->dbs_data->sampling_rate;
+}
+
+static struct dbs_governor schedutil_gov = {
+	.gov = {
+		.name = "schedutil",
+		.governor = cpufreq_governor_dbs,
+		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
+		.owner = THIS_MODULE,
+	},
+	.kobj_type = { .default_attrs = sugov_attributes },
+	.gov_dbs_timer = sugov_set_freq,
+	.alloc = sugov_alloc,
+	.free = sugov_free,
+	.start = sugov_start,
+};
+
+#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
+
+static int __init sugov_init(void)
+{
+	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+static void __exit sugov_exit(void)
+{
+	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+module_init(sugov_init);
+module_exit(sugov_exit);
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void cs_start(struct cpufreq_policy *policy)
+static bool cs_start(struct cpufreq_policy *policy)
 {
 	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->down_skip = 0;
 	dbs_info->requested_freq = policy->cur;
+	return true;
 }
 
 static struct dbs_governor cs_dbs_gov = {
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -473,9 +473,11 @@ static int cpufreq_governor_init(struct
 	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
 	mutex_init(&dbs_data->mutex);
 
-	ret = gov->init(dbs_data, !policy->governor->initialized);
-	if (ret)
-		goto free_policy_dbs_info;
+	if (gov->init) {
+		ret = gov->init(dbs_data, !policy->governor->initialized);
+		if (ret)
+			goto free_policy_dbs_info;
+	}
 
 	/* policy latency is in ns. Convert it to us first */
 	latency = policy->cpuinfo.transition_latency / 1000;
@@ -511,7 +513,10 @@ static int cpufreq_governor_init(struct
 
 	if (!have_governor_per_policy())
 		gov->gdbs_data = NULL;
-	gov->exit(dbs_data, !policy->governor->initialized);
+
+	if (gov->exit)
+		gov->exit(dbs_data, !policy->governor->initialized);
+
 	kfree(dbs_data);
 
 free_policy_dbs_info:
@@ -545,7 +550,9 @@ static int cpufreq_governor_exit(struct
 		if (!have_governor_per_policy())
 			gov->gdbs_data = NULL;
 
-		gov->exit(dbs_data, policy->governor->initialized == 1);
+		if (gov->exit)
+			gov->exit(dbs_data, policy->governor->initialized == 1);
+
 		mutex_destroy(&dbs_data->mutex);
 		kfree(dbs_data);
 	} else {
@@ -589,9 +596,9 @@ static int cpufreq_governor_start(struct
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 	}
 
-	gov->start(policy);
+	if (gov->start(policy))
+		gov_set_update_util(policy_dbs, sampling_rate);
 
-	gov_set_update_util(policy_dbs, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void od_start(struct cpufreq_policy *policy)
+static bool od_start(struct cpufreq_policy *policy)
 {
 	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	ondemand_powersave_bias_init(policy);
+	return true;
 }
 
 static struct od_ops od_ops = {
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHEDUTIL
+	tristate "'schedutil' cpufreq policy governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	help
+	  The frequency selection formula used by this governor is analogous
+	  to the one used by 'ondemand', but instead of computing CPU load
+	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+	  utilization data provided by the scheduler as input.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_conservative.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
 obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
 obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
 obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
 obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
 
 obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-21 23:18 ` [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
@ 2016-02-22 14:16   ` Juri Lelli
  2016-02-22 23:02     ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Lelli @ 2016-02-22 14:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

Hi Rafael,

thanks for this RFC. I'm going to test it more in the next few days, but
I already have some questions from skimming through it. Please find them
inline below.

On 22/02/16 00:18, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 

I guess the first (macro) question is why did you decide to go with a
complete new governor, where new here is w.r.t. the sched-freq solution.
AFAICT, it is true that your solution directly builds on top of the
latest changes to cpufreq core and governor, but it also seems to have
more than a few points in common with sched-freq, and sched-freq has
been discussed and evaluated for already quite some time. Also, it
appears to me that they both shares (or they might encounter in the
future as development progresses) the same kind of problems, like for
example the fact that we can't trigger opp changes from scheduler
context ATM.

Don't get me wrong. I think that looking at different ways to solve a
problem is always beneficial, since I guess that the goal in the end is
to come up with something that suits everybody's needs. I was only
curious about your thoughts on sched-freq. But we can also wait for the
next RFC from Steve for this macro question. :-)

[...]

> +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> +				unsigned int next_freq)
> +{
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +
> +	sg_policy->next_freq = next_freq;
> +	policy_dbs->last_sample_time = time;
> +	policy_dbs->work_in_progress = true;
> +	irq_work_queue(&policy_dbs->irq_work);

Here we basically use the system wq to be able to do the freq transition
in process context. CFS is probably fine with this, but don't you think
we might get into troubles when, in the future, we will want to service
RT/DL requests more properly and they will end up being serviced
together with all the others wq users and at !RT priority?

> +}
> +
> +static void sugov_update_shared(struct update_util_data *data, u64 time,
> +				unsigned long util, unsigned long max)
> +{

We don't have a way to tell from which scheduling class this is coming
from, do we? And if that is true can't a request from CFS overwrite
RT/DL go to max requests?

[...]

Anyway, I'm going to start using our existing testing infrastructure
used to evaluate sched-freq to try to better understand the implications
of your approach.

Best,

- Juri

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-22 14:16   ` Juri Lelli
@ 2016-02-22 23:02     ` Rafael J. Wysocki
  2016-02-23  7:20       ` Steve Muckle
  2016-02-25 11:01       ` Juri Lelli
  0 siblings, 2 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-22 23:02 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Steve Muckle,
	Ingo Molnar

On Mon, Feb 22, 2016 at 3:16 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Rafael,

Hi,

> thanks for this RFC. I'm going to test it more in the next few days, but
> I already have some questions from skimming through it. Please find them
> inline below.
>
> On 22/02/16 00:18, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Add a new cpufreq scaling governor, called "schedutil", that uses
>> scheduler-provided CPU utilization information as input for making
>> its decisions.
>>
>
> I guess the first (macro) question is why did you decide to go with a
> complete new governor, where new here is w.r.t. the sched-freq solution.

Probably the most comprehensive answer to this question is my intro
message: http://marc.info/?l=linux-pm&m=145609673008122&w=2

The executive summary is probably that this was the most
straightforward way to use the scheduler-provided numbers in cpufreq
that I could think about.

> AFAICT, it is true that your solution directly builds on top of the
> latest changes to cpufreq core and governor, but it also seems to have
> more than a few points in common with sched-freq,

That surely isn't a drawback, is it?

If two people come to the same conclusions in different ways, that's
an indication that the conclusions may actually be correct.

> and sched-freq has been discussed and evaluated for already quite some time.

Yes, it has.

Does this mean that no one is allowed to try any alternatives to it now?

> Also, it appears to me that they both shares (or they might encounter in the
> future as development progresses) the same kind of problems, like for
> example the fact that we can't trigger opp changes from scheduler
> context ATM.

"Give them a finger and they will ask for the hand."

If you read my intro message linked above, you'll find a paragraph or
two about that in it.

And the short summary is that I have a plan to actually implement that
feature in the schedutil governor at least for the ACPI cpufreq
driver.  It shouldn't be too difficult to do either AFAICS.  So it is
not "we can't", but rather "we haven't implemented that yet" in this
particular case.

I may not be able to do that in the next few days, as I have other
things to do too, but you may expect to see that done at one point.

So it's not a fundamental issue or anything, it's just that I haven't
done that *yet* at this point, OK?

> Don't get me wrong. I think that looking at different ways to solve a
> problem is always beneficial, since I guess that the goal in the end is
> to come up with something that suits everybody's needs.

Precisely.

> I was only curious about your thoughts on sched-freq. But we can also wait for the
> next RFC from Steve for this macro question. :-)

Right, but I have some thoughts anyway.

My goal, that may be quite different from yours, is to reduce the
cpufreq's overhead as much as I possibly can.  If I have to change the
way it drives the CPU frequency selection to achieve that goal, I will
do that, but if that can stay the way it is, that's fine too.

Some progress has been made already here: we have dealt with the
timers for good now I think.

This patch deals with the overhead associated with the load tracking
carried by "traditional" cpufreq governors and with a couple of
questionable things done by "ondemand" in addition to that (which is
one of the reasons why I didn't want to modify "ondemand" itself for
now).

The next step will be to teach the governor and the ACPI driver to
switch CPU frequencies in the scheduler context, without spawning
extra work items etc.

Finally, the sampling should go away and that's where I want it to be.

I just don't want to run extra code when that's not necessary and I
want things to stay simple when that's as good as it can get.  If
sched-freq can pull that off for me, that's fine, but can it really?

> [...]
>
>> +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
>> +                             unsigned int next_freq)
>> +{
>> +     struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
>> +
>> +     sg_policy->next_freq = next_freq;
>> +     policy_dbs->last_sample_time = time;
>> +     policy_dbs->work_in_progress = true;
>> +     irq_work_queue(&policy_dbs->irq_work);
>
> Here we basically use the system wq to be able to do the freq transition
> in process context. CFS is probably fine with this, but don't you think
> we might get into troubles when, in the future, we will want to service
> RT/DL requests more properly and they will end up being serviced
> together with all the others wq users and at !RT priority?

That may be regarded as a problem, but I'm not sure why you're talking
about it in the context of this particular patch.  That problem has
been there forever in cpufreq: in theory RT tasks may stall frequency
changes indefinitely.

Is the problem real, though?

Suppose that that actually happens and there are RT tasks effectively
stalling frequency updates.  In that case some other important
activities of the kernel would be stalled too.  Pretty much everything
run out of regular workqueues would be affected.

OK, so do we need to address that somehow?  Maybe, but then we should
take care of all of the potentially affected stuff and not about the
frequency changes only, shouldn't we?

Moreover, I don't really think that having a separate RT process for
every CPU in the system just for this purpose is the right approach.
It's just way overkill IMO and doesn't cover the other potentially
affected stuff at all.

>> +}
>> +
>> +static void sugov_update_shared(struct update_util_data *data, u64 time,
>> +                             unsigned long util, unsigned long max)
>> +{
>
> We don't have a way to tell from which scheduling class this is coming
> from, do we? And if that is true can't a request from CFS overwrite
> RT/DL go to max requests?

Yes, it can, but we get updated when the CPU is updating its own rq
only, so if my understanding of things is correct, an update from a
different sched class means that the given class is in charge now.

> [...]
>
> Anyway, I'm going to start using our existing testing infrastructure
> used to evaluate sched-freq to try to better understand the implications
> of your approach.

OK, thanks!

Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-22 23:02     ` Rafael J. Wysocki
@ 2016-02-23  7:20       ` Steve Muckle
  2016-02-24  1:38         ` Rafael J. Wysocki
  2016-02-25 11:01       ` Juri Lelli
  1 sibling, 1 reply; 32+ messages in thread
From: Steve Muckle @ 2016-02-23  7:20 UTC (permalink / raw)
  To: Rafael J. Wysocki, Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Ingo Molnar

On 02/22/2016 03:02 PM, Rafael J. Wysocki wrote:
>> I guess the first (macro) question is why did you decide to go with a
>> complete new governor, where new here is w.r.t. the sched-freq solution.
> 
> Probably the most comprehensive answer to this question is my intro
> message: http://marc.info/?l=linux-pm&m=145609673008122&w=2
> 
> The executive summary is probably that this was the most
> straightforward way to use the scheduler-provided numbers in cpufreq
> that I could think about.
> 
>> AFAICT, it is true that your solution directly builds on top of the
>> latest changes to cpufreq core and governor, but it also seems to have
>> more than a few points in common with sched-freq,
> 
> That surely isn't a drawback, is it?
>
> If two people come to the same conclusions in different ways, that's
> an indication that the conclusions may actually be correct.
> 
>> and sched-freq has been discussed and evaluated for already quite some time.
> 
> Yes, it has.
> 
> Does this mean that no one is allowed to try any alternatives to it now?

As mentioned above they are rather similar so it doesn't really seem
like an alternative per se, more like a reimplementation.

Why do you feel a new starting point for this problem is needed? Are
there specific technical concerns? I see you started looking over the
latest schedfreq RFC, thank you for your comments thus far. We'd really
appreciate your continued feedback and the chance to collaborate on it
to move it forward. I and others have put a fair bit of effort into it
over the last year or so and will happily and earnestly work to address
any shortcomings you raise.

I will review your RFC in the next day or so as well.

...
> My goal, that may be quite different from yours, is to reduce the
> cpufreq's overhead as much as I possibly can.  If I have to change the
> way it drives the CPU frequency selection to achieve that goal, I will
> do that, but if that can stay the way it is, that's fine too.

Our primary goal has been simply to achieve functional scheduler-driven
CPU frequency control with equivalent or better power and performance
than what is available today. Reduction of cpufreq overhead fits within
this goal (and may be required) so no conflict here.

thanks,
Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH v2 0/2] cpufreq: New governor based on scheduler-provided utilization data
  2016-02-21 23:16 [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
  2016-02-21 23:18 ` [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
@ 2016-02-24  1:20 ` Rafael J. Wysocki
  2016-02-24  1:22   ` [RFC/RFT][PATCH v2 1/2] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
  2016-02-24  1:28   ` [RFC/RFT][PATCH v2 2/2] cpufreq: schedutil: Switching frequencies from interrupt context Rafael J. Wysocki
  2016-03-03 14:27 ` [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Ingo Molnar
  2 siblings, 2 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24  1:20 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

On Monday, February 22, 2016 12:16:11 AM Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> Usually, I don't send introductory messages for single patches, but this
> one is an exception, because I didn't want to put all of my considerations
> into the patch changelog.
> 
> So I have been told for a few times already that I should not introduce
> interfaces passing arguments that aren't used in the current code and without
> telling anyone what my plans for using those aguments in the future may be
> (although IMO that would not be too hard to figure out), so here's an example.
> 
> Juri, that's not what you may have expected.  In fact, I didn't expect it to
> look like this either when I started to think about it.  Initially, I was
> considering to modify the existing governors to use the utilization data
> somehow, but then I realized that it would make them behave differently and
> that might confuse some.
> 
> So here it is: a new functional cpufreq governor.  It is very simple (arguably
> on the verge of being overly simplistic), but it gets the job done.  I have only
> tested it (very lightly) on a system with one CPU per cpufreq policy (so the
> "shared" path in it is admittedly untested), but in that simple case the
> frequency evidently follows the CPU utilization as expected.
> 
> The reason why I didn't post it earlier was because I needed to clean up the
> existing governor code enough to be able to do anything new on top of it (you
> might have noticed the cleanup work going during the last couple of weeks).
> 
> Now, there are a few observations to be made about it that may be interesting
> to someone (they definitely are interesting to me).  Some of them are mentioned
> in the patch changelog too.
> 
> First off, note how simple it is: 250 lines of code including struct definitions
> and the boilerplate part (and the copyright notice and all).  It might be quite
> difficult to invent something simpler and still doing significant work.
> 
> As is, it may not make the best scaling decisions (in particular, it will tend
> to over-provision DL tasks), but at least it sould be very predictable.  I might
> have added things like up_threshold and sampling_down_factor to it, but I decided
> against doing that as it would muddy the waters a bit.  Also, when I had tested
> it, it looked aggressive enough to me without those.
> 
> Second, note that the majority of work in it is done in the callbacks invoked
> from scheduler code paths.  If cpufreq drivers are modified to provide a "fast
> frequency update" method that will be practical to invoke from those paths, *all*
> of the work in that governor may be done in there.  It's almost like the scheduler
> telling the frequency scaling driver directly "this is your frequency, use it".
> 
> Next, it is hooked up to the existing cpufreq governor infrastructure which
> allows the existing sysfs interface that people are used to and familiar with to
> be used with it.  That also allows any existing cpufreq drivers to be used with
> the new governor without any modifications, so if you are interested in how it
> compares with "ondemand" and "conservative", apply the patch, build the new
> governor into the kernel and echo "schedutil" to "scaling_governor" for your CPUs. :-)
> 
> [It cannot be made the default cpufreq governor ATM (for a bit of safety), but
> that can be changed easily enough if someone wants to.]
> 
> Further, it is a "sampling governor" on the surface, but this really is not a
> hard requirement.  In fact, it is quite straightforward to notice that util and
> max are used directly as obtained from the scheduler without any sampling.  If
> my understanding of the relevant CFS code is correct, util already contains
> contributions form what happened in the past, so it should be fine to use it as
> provided.
> 
> The sampling rate really plays the role of a rate limit for frequency updates.
> The current code rather needs that because of the way it updates the CPU frequency
> (from a work item run in process context), but if (at least some) cpufreq drivers
> are taught to update the frequency "on the fly", it should be possible to dispense
> with the sampling.  Of course, we still may find that rate limitting CPU
> frequency changes is generally useful, but there may be special "emergency"
> updates from the scheduler that will be reacted to immediately without
> waiting for the whole "sampling period" to pass, for example.
> 
> Moreover, the new governor departs from the "let's code for the most complicated
> case and the simpler ones will be handled automatically" approach that seems to
> have been used throughout cpufreq, as it explicitly makes the "one CPU per cpufreq
> policy" case special.  In that case, the values of util and max are not even
> stored in the governor's data structures, but used immediately.  That allows it
> to reduce the extra overhead from itself when possible.
> 
> Finally, but not least importantly, the new governor is completely generic.  It
> doesn't depend on any system-specific or architecture-specific magic (other than
> the policy sharing on systems where groups of CPUs have to be handled together)
> to get the job done.  Thus it may be possible to use it as a base line for more
> sophisticated frequency scaling solutions.
> 
> That last bit may be particularly important for systems where the only source
> of information on the available frequency+voltage configurations of the CPUs
> is something like ACPI tables and there is no information on the respective
> cost of putting the CPUs into those configurations in terms of energy (and
> no information on how much energy is consumed in the idle states available
> on the given system).  With so little information on the "power topology" of
> the system, so to speak, using the "frequency follows the utilization" rule
> may simply be as good as it gets.  Even then (or maybe especially in those
> cases), the frequency scaling mechanism should be reasonably lightweight and
> effective, if possible, and this governor indicates that, indeed, that should
> be possible to achieve.
> 
> There are two way in which this can be taken further.  The first, quite
> obvious, one is to make it possible for cpufreq drivers to provide a method
> for switching frequencies from interrupt context so as to avoid the need to
> use the process-context work items for that, where possible.  The second one,
> depending on the former, would be to try to eliminate the sampling rate and
> simply update the frequency whenever the utilization changes and see how far
> that would take us.  In addition to that, one may want to play with the
> frequency selection formula (eg. to make it more or less aggressive etc).
> 
> The patch is on top of the linux-next branch of the linux-pm.git tree (that
> should be part of the tomorrow's linux-next if all goes well), but it should
> also apply on top of the pm-cpufreq-test branch in that tree (which only
> contains changes related to cpufreq governors).

I have a new version of this with one modification and a patch implementing
frequency changes from interrupt context on top of it.  Both patches will
follow.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH v2 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-02-24  1:20 ` [RFC/RFT][PATCH v2 0/2] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
@ 2016-02-24  1:22   ` Rafael J. Wysocki
  2016-02-25 21:14     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
  2016-02-24  1:28   ` [RFC/RFT][PATCH v2 2/2] cpufreq: schedutil: Switching frequencies from interrupt context Rafael J. Wysocki
  1 sibling, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24  1:22 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is very simple.  It is almost as simple as it
can be and remain reasonably functional.

The frequency selection formula used by it is essentially the same
as the one used by the "ondemand" governor, although it doesn't use
the additional up_threshold parameter, but instead of computing the
load as the "non-idle CPU time" to "total CPU time" ratio, it takes
the utilization data provided by CFS as input.  More specifically,
it represents "load" as the util/max ratio, where util and max
are the utilization and CPU capacity coming from CFS.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).  The only operation carried
out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
is a __cpufreq_driver_target() call to trigger a frequency update (to
a value already computed beforehand in one of the utilization update
handlers).  This means that, at least for some cpufreq drivers that
can update CPU frequency by doing simple register writes, it should
be possible to set the frequency in the utilization update handlers
too in which case all of the governor's activity would take place in
the scheduler paths invoking cpufreq_update_util() without the need
to run anything in process context.

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
specifically targeted at RT and DL tasks.

To some extent it relies on the common governor code in
cpufreq_governor.c and it uses that code in a somewhat unusual
way (different from what the "ondemand" and "conservative"
governors do), so some small and rather unintrusive changes
have to be made in that code and the other governors to support it.

However, after making it possible to set the CPU frequency from
the utilization update handlers, that new governor's interactions
with the common code might be limited to the initialization, cleanup
and handling of sysfs attributes (currently only one attribute,
sampling_rate, is supported in addition to the standard policy
attributes handled by the cpufreq core).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Changes from v1:
- Use policy->min and policy->max as min/max frequency in computations.

---
 drivers/cpufreq/Kconfig                |   15 +
 drivers/cpufreq/Makefile               |    1 
 drivers/cpufreq/cpufreq_conservative.c |    3 
 drivers/cpufreq/cpufreq_governor.c     |   21 +-
 drivers/cpufreq/cpufreq_governor.h     |    2 
 drivers/cpufreq/cpufreq_ondemand.c     |    3 
 drivers/cpufreq/cpufreq_schedutil.c    |  249 +++++++++++++++++++++++++++++++++
 7 files changed, 284 insertions(+), 10 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -164,7 +164,7 @@ struct dbs_governor {
 	void (*free)(struct policy_dbs_info *policy_dbs);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
-	void (*start)(struct cpufreq_policy *policy);
+	bool (*start)(struct cpufreq_policy *policy);
 };
 
 static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,249 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/percpu-defs.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_policy {
+	struct policy_dbs_info policy_dbs;
+	unsigned int next_freq;
+	raw_spinlock_t update_lock;  /* For shared policies */
+};
+
+static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
+{
+	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
+}
+
+struct sugov_cpu {
+	struct update_util_data update_util;
+	struct policy_dbs_info *policy_dbs;
+	/* The fields below are only needed when sharing a policy. */
+	unsigned long util;
+	unsigned long max;
+	u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
+				    unsigned long util, unsigned long max,
+				    u64 last_sample_time)
+{
+	struct cpufreq_policy *policy = policy_dbs->policy;
+	unsigned int min_f = policy->min;
+	unsigned int max_f = policy->max;
+	unsigned int j;
+
+	if (util > max || min_f >= max_f)
+		return max_f;
+
+	for_each_cpu(j, policy->cpus) {
+		struct sugov_cpu *j_sg_cpu;
+		unsigned long j_util, j_max;
+
+		if (j == smp_processor_id())
+			continue;
+
+		j_sg_cpu = &per_cpu(sugov_cpu, j);
+		/*
+		 * If the CPU was last updated before the previous sampling
+		 * time, we don't take it into account as it probably is idle
+		 * now.
+		 */
+		if (j_sg_cpu->last_update < last_sample_time)
+			continue;
+
+		j_util = j_sg_cpu->util;
+		j_max = j_sg_cpu->max;
+		if (j_util > j_max)
+			return max_f;
+
+		if (j_util * max > j_max * util) {
+			util = j_util;
+			max = j_max;
+		}
+	}
+
+	return min_f + util * (max_f - min_f) / max;
+}
+
+static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
+				unsigned int next_freq)
+{
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	sg_policy->next_freq = next_freq;
+	policy_dbs->last_sample_time = time;
+	policy_dbs->work_in_progress = true;
+	irq_work_queue(&policy_dbs->irq_work);
+}
+
+static void sugov_update_shared(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+	unsigned int next_f;
+	u64 delta_ns, lst;
+
+	raw_spin_lock(&sg_policy->update_lock);
+
+	sg_cpu->util = util;
+	sg_cpu->max = max;
+	sg_cpu->last_update = time;
+
+	if (policy_dbs->work_in_progress)
+		goto out;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	lst = policy_dbs->last_sample_time;
+	delta_ns = time - lst;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		goto out;
+
+	next_f = sugov_next_freq(policy_dbs, util, max, lst);
+
+	sugov_update_commit(policy_dbs, time, next_f);
+
+ out:
+	raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_update_single(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	unsigned int min_f, max_f, next_f;
+	u64 delta_ns;
+
+	if (policy_dbs->work_in_progress)
+		return;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	delta_ns = time - policy_dbs->last_sample_time;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		return;
+
+	min_f = policy_dbs->policy->min;
+	max_f = policy_dbs->policy->max;
+	next_f = util > max || min_f >= max_f ? max_f :
+			min_f + util * (max_f - min_f) / max;
+
+	sugov_update_commit(policy_dbs, time, next_f);
+}
+
+/************************** sysfs interface ************************/
+
+gov_show_one_common(sampling_rate);
+
+gov_attr_rw(sampling_rate);
+
+static struct attribute *sugov_attributes[] = {
+	&sampling_rate.attr,
+	NULL
+};
+
+/********************** dbs_governor interface *********************/
+
+static struct policy_dbs_info *sugov_alloc(void)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+	if (!sg_policy)
+		return NULL;
+
+	raw_spin_lock_init(&sg_policy->update_lock);
+	return &sg_policy->policy_dbs;
+}
+
+static void sugov_free(struct policy_dbs_info *policy_dbs)
+{
+	kfree(to_sg_policy(policy_dbs));
+}
+
+static bool sugov_start(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	unsigned int cpu;
+
+	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
+	policy_dbs->last_sample_time = 0;
+
+	for_each_cpu(cpu, policy->cpus) {
+		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+		sg_cpu->policy_dbs = policy_dbs;
+		if (policy_dbs->is_shared) {
+			sg_cpu->util = ULONG_MAX;
+			sg_cpu->max = 0;
+			sg_cpu->last_update = 0;
+			sg_cpu->update_util.func = sugov_update_shared;
+		} else {
+			sg_cpu->update_util.func = sugov_update_single;
+		}
+		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
+	}
+	return false;
+}
+
+static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
+	return policy_dbs->dbs_data->sampling_rate;
+}
+
+static struct dbs_governor schedutil_gov = {
+	.gov = {
+		.name = "schedutil",
+		.governor = cpufreq_governor_dbs,
+		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
+		.owner = THIS_MODULE,
+	},
+	.kobj_type = { .default_attrs = sugov_attributes },
+	.gov_dbs_timer = sugov_set_freq,
+	.alloc = sugov_alloc,
+	.free = sugov_free,
+	.start = sugov_start,
+};
+
+#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
+
+static int __init sugov_init(void)
+{
+	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+static void __exit sugov_exit(void)
+{
+	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+module_init(sugov_init);
+module_exit(sugov_exit);
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void cs_start(struct cpufreq_policy *policy)
+static bool cs_start(struct cpufreq_policy *policy)
 {
 	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->down_skip = 0;
 	dbs_info->requested_freq = policy->cur;
+	return true;
 }
 
 static struct dbs_governor cs_dbs_gov = {
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -472,9 +472,11 @@ static int cpufreq_governor_init(struct
 	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
 	mutex_init(&dbs_data->mutex);
 
-	ret = gov->init(dbs_data, !policy->governor->initialized);
-	if (ret)
-		goto free_policy_dbs_info;
+	if (gov->init) {
+		ret = gov->init(dbs_data, !policy->governor->initialized);
+		if (ret)
+			goto free_policy_dbs_info;
+	}
 
 	/* policy latency is in ns. Convert it to us first */
 	latency = policy->cpuinfo.transition_latency / 1000;
@@ -510,7 +512,10 @@ static int cpufreq_governor_init(struct
 
 	if (!have_governor_per_policy())
 		gov->gdbs_data = NULL;
-	gov->exit(dbs_data, !policy->governor->initialized);
+
+	if (gov->exit)
+		gov->exit(dbs_data, !policy->governor->initialized);
+
 	kfree(dbs_data);
 
 free_policy_dbs_info:
@@ -544,7 +549,9 @@ static int cpufreq_governor_exit(struct
 		if (!have_governor_per_policy())
 			gov->gdbs_data = NULL;
 
-		gov->exit(dbs_data, policy->governor->initialized == 1);
+		if (gov->exit)
+			gov->exit(dbs_data, policy->governor->initialized == 1);
+
 		mutex_destroy(&dbs_data->mutex);
 		kfree(dbs_data);
 	} else {
@@ -588,9 +595,9 @@ static int cpufreq_governor_start(struct
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 	}
 
-	gov->start(policy);
+	if (gov->start(policy))
+		gov_set_update_util(policy_dbs, sampling_rate);
 
-	gov_set_update_util(policy_dbs, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void od_start(struct cpufreq_policy *policy)
+static bool od_start(struct cpufreq_policy *policy)
 {
 	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	ondemand_powersave_bias_init(policy);
+	return true;
 }
 
 static struct od_ops od_ops = {
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHEDUTIL
+	tristate "'schedutil' cpufreq policy governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	help
+	  The frequency selection formula used by this governor is analogous
+	  to the one used by 'ondemand', but instead of computing CPU load
+	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+	  utilization data provided by the scheduler as input.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_conservative.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
 obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
 obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
 obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
 obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
 
 obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH v2 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-24  1:20 ` [RFC/RFT][PATCH v2 0/2] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
  2016-02-24  1:22   ` [RFC/RFT][PATCH v2 1/2] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
@ 2016-02-24  1:28   ` Rafael J. Wysocki
  2016-02-24 23:30     ` [RFC/RFT][PATCH v3 " Rafael J. Wysocki
  2016-02-25 21:20     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
  1 sibling, 2 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24  1:28 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
and the schedutil governor to use that method if available.

Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context via
new helper function cpufreq_driver_fast_switch().

Modify the schedutil governor to call cpufreq_driver_fast_switch()
from its sugov_update_commit() function and avoid queuing up the
irq_work if that is successful.

Implement the ->fast_switch callback in the ACPI cpufreq driver
(with a limited coverage for the time being).

In addition to the above, cpufreq_governor_limits() is modified so
it doesn't call __cpufreq_driver_target() to enforce the new limits
immediately as they will be take into account anyway during the next
update from the scheduler.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This actually is the first version of the $subject patch, but since it belongs
to the schedutil governor combo, I've given it the v2.

Please note that this is a prototype, so it may not be done the way I'll want
to do it finally, although ATM I don't quite see how that might be done in
a significantly different way.  Ideas welcome, however.

It works on my test machine and doesn't break powertop even.

Thanks,
Rafael

---
 drivers/cpufreq/acpi-cpufreq.c      |   63 ++++++++++++++++++++++++++++++++++++
 drivers/cpufreq/cpufreq.c           |   35 ++++++++++++++++++++
 drivers/cpufreq/cpufreq_governor.c  |    8 ----
 drivers/cpufreq/cpufreq_schedutil.c |   20 ++++++++---
 include/linux/cpufreq.h             |    4 ++
 5 files changed, 117 insertions(+), 13 deletions(-)

Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -70,6 +70,7 @@ struct acpi_cpufreq_data {
 	unsigned int cpu_feature;
 	unsigned int acpi_perf_cpu;
 	cpumask_var_t freqdomain_cpus;
+	void (*cpu_freq_fast_write)(u32 val);
 };
 
 /* acpi_perf_data is a pointer to percpu data. */
@@ -243,6 +244,15 @@ static unsigned extract_freq(u32 val, st
 	}
 }
 
+void cpu_freq_fast_write_intel(u32 val)
+{
+	u32 lo, hi;
+
+	rdmsr(MSR_IA32_PERF_CTL, lo, hi);
+	lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE);
+	wrmsr(MSR_IA32_PERF_CTL, lo, hi);
+}
+
 struct msr_addr {
 	u32 reg;
 };
@@ -484,6 +494,53 @@ out:
 	return result;
 }
 
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+				      unsigned int target_freq)
+{
+	struct acpi_cpufreq_data *data = policy->driver_data;
+	struct cpufreq_frequency_table *entry;
+	struct acpi_processor_performance *perf;
+	unsigned int uninitialized_var(next_perf_state);
+	unsigned int uninitialized_var(next_freq);
+	unsigned int best_diff;
+
+	if (!data->cpu_freq_fast_write)
+		return CPUFREQ_ENTRY_INVALID;
+
+	for (entry = data->freq_table, best_diff = UINT_MAX;
+	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
+		unsigned int diff, freq = entry->frequency;
+
+		if (freq == CPUFREQ_ENTRY_INVALID)
+			continue;
+
+		diff = abs(freq - target_freq);
+		if (diff >= best_diff)
+			continue;
+
+		best_diff = diff;
+		next_perf_state = entry->driver_data;
+		next_freq = freq;
+		if (best_diff == 0)
+			goto found;
+	}
+	if (best_diff == UINT_MAX)
+		return CPUFREQ_ENTRY_INVALID;
+
+ found:
+	perf = to_perf_data(data);
+	if (perf->state == next_perf_state) {
+		if (unlikely(data->resume))
+			data->resume = 0;
+		else
+			return next_freq;
+	}
+
+	data->cpu_freq_fast_write(perf->states[next_perf_state].control);
+	perf->state = next_perf_state;
+	return next_freq;
+}
+
 static unsigned long
 acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
 {
@@ -745,6 +802,7 @@ static int acpi_cpufreq_cpu_init(struct
 		pr_debug("HARDWARE addr space\n");
 		if (check_est_cpu(cpu)) {
 			data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE;
+			data->cpu_freq_fast_write = cpu_freq_fast_write_intel;
 			break;
 		}
 		if (check_amd_hwpstate_cpu(cpu)) {
@@ -760,6 +818,10 @@ static int acpi_cpufreq_cpu_init(struct
 		goto err_unreg;
 	}
 
+	if (acpi_pstate_strict || (policy_is_shared(policy) &&
+	    policy->shared_type != CPUFREQ_SHARED_TYPE_ANY))
+		data->cpu_freq_fast_write = NULL;
+
 	data->freq_table = kzalloc(sizeof(*data->freq_table) *
 		    (perf->state_count+1), GFP_KERNEL);
 	if (!data->freq_table) {
@@ -894,6 +956,7 @@ static struct freq_attr *acpi_cpufreq_at
 static struct cpufreq_driver acpi_cpufreq_driver = {
 	.verify		= cpufreq_generic_frequency_table_verify,
 	.target_index	= acpi_cpufreq_target,
+	.fast_switch	= acpi_cpufreq_fast_switch,
 	.bios_limit	= acpi_processor_get_bios_limit,
 	.init		= acpi_cpufreq_cpu_init,
 	.exit		= acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -271,6 +271,8 @@ struct cpufreq_driver {
 				  unsigned int relation);	/* Deprecated */
 	int		(*target_index)(struct cpufreq_policy *policy,
 					unsigned int index);
+	unsigned int	(*fast_switch)(struct cpufreq_policy *policy,
+				       unsigned int target_freq);
 	/*
 	 * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
 	 * unset.
@@ -485,6 +487,8 @@ struct cpufreq_governor {
 };
 
 /* Pass a target to the cpufreq driver */
+bool cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+				unsigned int target_freq);
 int cpufreq_driver_target(struct cpufreq_policy *policy,
 				 unsigned int target_freq,
 				 unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -1814,6 +1814,41 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
  *                              GOVERNORS                            *
  *********************************************************************/
 
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * It is guaranteed that this function will never be called twice in parallel
+ * for the same policy and that it will not be called in parallel with either
+ * ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback, the hardware configuration must be preserved.
+ *
+ * Return 'true' on success and 'false' on failures.
+ */
+bool cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+				unsigned int target_freq)
+{
+	if (target_freq == policy->cur)
+		return true;
+
+	if (cpufreq_driver->fast_switch) {
+		unsigned int freq;
+
+		freq = cpufreq_driver->fast_switch(policy, target_freq);
+		if (freq != CPUFREQ_ENTRY_INVALID) {
+			policy->cur = freq;
+			trace_cpu_frequency(freq, smp_processor_id());
+			return true;
+		}
+	}
+	return false;
+}
+
 /* Must set freqs->new to intermediate frequency */
 static int __target_intermediate(struct cpufreq_policy *policy,
 				 struct cpufreq_freqs *freqs, int index)
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_schedutil.c
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -83,12 +83,22 @@ static unsigned int sugov_next_freq(stru
 static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
 				unsigned int next_freq)
 {
-	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
-
-	sg_policy->next_freq = next_freq;
 	policy_dbs->last_sample_time = time;
-	policy_dbs->work_in_progress = true;
-	irq_work_queue(&policy_dbs->irq_work);
+
+	if (cpufreq_driver_fast_switch(policy_dbs->policy, next_freq)) {
+		/*
+		 * Restore the sample delay in case it has been set to 0
+		 * from sysfs in the meantime.
+		 */
+		gov_update_sample_delay(policy_dbs,
+					policy_dbs->dbs_data->sampling_rate);
+	} else {
+		struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+		sg_policy->next_freq = next_freq;
+		policy_dbs->work_in_progress = true;
+		irq_work_queue(&policy_dbs->irq_work);
+	}
 }
 
 static void sugov_update_shared(struct update_util_data *data, u64 time,
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -612,16 +612,8 @@ static int cpufreq_governor_limits(struc
 	struct policy_dbs_info *policy_dbs = policy->governor_data;
 
 	mutex_lock(&policy_dbs->timer_mutex);
-
-	if (policy->max < policy->cur)
-		__cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H);
-	else if (policy->min > policy->cur)
-		__cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L);
-
 	gov_update_sample_delay(policy_dbs, 0);
-
 	mutex_unlock(&policy_dbs->timer_mutex);
-
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-23  7:20       ` Steve Muckle
@ 2016-02-24  1:38         ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24  1:38 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Juri Lelli, Linux PM list,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Ingo Molnar

On Monday, February 22, 2016 11:20:33 PM Steve Muckle wrote:
> On 02/22/2016 03:02 PM, Rafael J. Wysocki wrote:
> >> I guess the first (macro) question is why did you decide to go with a
> >> complete new governor, where new here is w.r.t. the sched-freq solution.
> > 
> > Probably the most comprehensive answer to this question is my intro
> > message: http://marc.info/?l=linux-pm&m=145609673008122&w=2
> > 
> > The executive summary is probably that this was the most
> > straightforward way to use the scheduler-provided numbers in cpufreq
> > that I could think about.
> > 
> >> AFAICT, it is true that your solution directly builds on top of the
> >> latest changes to cpufreq core and governor, but it also seems to have
> >> more than a few points in common with sched-freq,
> > 
> > That surely isn't a drawback, is it?
> >
> > If two people come to the same conclusions in different ways, that's
> > an indication that the conclusions may actually be correct.
> > 
> >> and sched-freq has been discussed and evaluated for already quite some time.
> > 
> > Yes, it has.
> > 
> > Does this mean that no one is allowed to try any alternatives to it now?
> 
> As mentioned above they are rather similar so it doesn't really seem
> like an alternative per se, more like a reimplementation.

If that is the case, I don't quite see where or what the problem is.

I posted this mostly because you and Juri were complaining that I wasn't
telling anyone about how I was going to use util and max going forward.

So this is how I'd like to use them, more or less.  If that is in alignment
with the changes you want to make, all should be fine.

> Why do you feel a new starting point for this problem is needed? Are
> there specific technical concerns?

Well, let me comment the patches you've sent (although not today maybe as I'm
quite tired already and I'm afraid that my comments may not be much to the
point).

That aside, this was rather an attempt to see what could be done on top of
recent fixes in the core and how complicated it would be.

> I see you started looking over the
> latest schedfreq RFC, thank you for your comments thus far. We'd really
> appreciate your continued feedback and the chance to collaborate on it
> to move it forward. I and others have put a fair bit of effort into it
> over the last year or so and will happily and earnestly work to address
> any shortcomings you raise.
> 
> I will review your RFC in the next day or so as well.
> 
> ...
> > My goal, that may be quite different from yours, is to reduce the
> > cpufreq's overhead as much as I possibly can.  If I have to change the
> > way it drives the CPU frequency selection to achieve that goal, I will
> > do that, but if that can stay the way it is, that's fine too.
> 
> Our primary goal has been simply to achieve functional scheduler-driven
> CPU frequency control with equivalent or better power and performance
> than what is available today. Reduction of cpufreq overhead fits within
> this goal (and may be required) so no conflict here.

Good.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-24  1:28   ` [RFC/RFT][PATCH v2 2/2] cpufreq: schedutil: Switching frequencies from interrupt context Rafael J. Wysocki
@ 2016-02-24 23:30     ` Rafael J. Wysocki
  2016-02-25  9:08       ` Peter Zijlstra
  2016-02-25 21:20     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
  1 sibling, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24 23:30 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
and the schedutil governor to use that method if available.

Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context via
new helper function cpufreq_driver_fast_switch().

Modify the schedutil governor to call cpufreq_driver_fast_switch()
from its sugov_update_commit() function and avoid queuing up the
irq_work if that is successful.

Implement the ->fast_switch callback in the ACPI cpufreq driver
(with a limited coverage for the time being).

In addition to the above, cpufreq_governor_limits() is modified so
it doesn't call __cpufreq_driver_target() to enforce the new limits
immediately if the fast_switch_enabled flag is set for the policy,
because in that case the frequency will be updated immediately
using the new limits anyway and the additional invocation of
__cpufreq_driver_target() might be racing with that violating
the cpufreq_driver_fast_switch() requirements.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

An update here.

I wasn't sure about the changes in cpufreq_governor_limits() and I finally
concluded that for the case when "fast switching" was not used, that function
still should set the frequency within the limits directly (because it may
take some time for a regular update to happen even though the sample delay
is reset in there).  OTOH, if "fast switching" is used, a regular update is
pretty much guaranteed to happen immediately, so the direct frequency setting
should not be necessary in that case.

That observation led to some rework all over (exept for acpi_cpufreq_fast_switch()
and cpu_freq_fast_write_intel() which are the same as before).

Of course, the patch still is a prototype. :-)

Thanks,
Rafael

---
 drivers/cpufreq/acpi-cpufreq.c      |   60 ++++++++++++++++++++++++++++++++++++
 drivers/cpufreq/cpufreq.c           |   31 ++++++++++++++++++
 drivers/cpufreq/cpufreq_governor.c  |   13 ++++---
 drivers/cpufreq/cpufreq_governor.h  |    1 
 drivers/cpufreq/cpufreq_schedutil.c |   22 ++++++++++---
 include/linux/cpufreq.h             |    5 +++
 6 files changed, 122 insertions(+), 10 deletions(-)

Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -70,6 +70,7 @@ struct acpi_cpufreq_data {
 	unsigned int cpu_feature;
 	unsigned int acpi_perf_cpu;
 	cpumask_var_t freqdomain_cpus;
+	void (*cpu_freq_fast_write)(u32 val);
 };
 
 /* acpi_perf_data is a pointer to percpu data. */
@@ -243,6 +244,15 @@ static unsigned extract_freq(u32 val, st
 	}
 }
 
+void cpu_freq_fast_write_intel(u32 val)
+{
+	u32 lo, hi;
+
+	rdmsr(MSR_IA32_PERF_CTL, lo, hi);
+	lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE);
+	wrmsr(MSR_IA32_PERF_CTL, lo, hi);
+}
+
 struct msr_addr {
 	u32 reg;
 };
@@ -484,6 +494,50 @@ out:
 	return result;
 }
 
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+				      unsigned int target_freq)
+{
+	struct acpi_cpufreq_data *data = policy->driver_data;
+	struct cpufreq_frequency_table *entry;
+	struct acpi_processor_performance *perf;
+	unsigned int uninitialized_var(next_perf_state);
+	unsigned int uninitialized_var(next_freq);
+	unsigned int best_diff;
+
+	for (entry = data->freq_table, best_diff = UINT_MAX;
+	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
+		unsigned int diff, freq = entry->frequency;
+
+		if (freq == CPUFREQ_ENTRY_INVALID)
+			continue;
+
+		diff = abs(freq - target_freq);
+		if (diff >= best_diff)
+			continue;
+
+		best_diff = diff;
+		next_perf_state = entry->driver_data;
+		next_freq = freq;
+		if (best_diff == 0)
+			goto found;
+	}
+	if (best_diff == UINT_MAX)
+		return CPUFREQ_ENTRY_INVALID;
+
+ found:
+	perf = to_perf_data(data);
+	if (perf->state == next_perf_state) {
+		if (unlikely(data->resume))
+			data->resume = 0;
+		else
+			return next_freq;
+	}
+
+	data->cpu_freq_fast_write(perf->states[next_perf_state].control);
+	perf->state = next_perf_state;
+	return next_freq;
+}
+
 static unsigned long
 acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
 {
@@ -745,6 +799,7 @@ static int acpi_cpufreq_cpu_init(struct
 		pr_debug("HARDWARE addr space\n");
 		if (check_est_cpu(cpu)) {
 			data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE;
+			data->cpu_freq_fast_write = cpu_freq_fast_write_intel;
 			break;
 		}
 		if (check_amd_hwpstate_cpu(cpu)) {
@@ -760,6 +815,10 @@ static int acpi_cpufreq_cpu_init(struct
 		goto err_unreg;
 	}
 
+	policy->fast_switch_possible = data->cpu_freq_fast_write != NULL &&
+		!acpi_pstate_strict && !(policy_is_shared(policy) &&
+			policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
 	data->freq_table = kzalloc(sizeof(*data->freq_table) *
 		    (perf->state_count+1), GFP_KERNEL);
 	if (!data->freq_table) {
@@ -894,6 +953,7 @@ static struct freq_attr *acpi_cpufreq_at
 static struct cpufreq_driver acpi_cpufreq_driver = {
 	.verify		= cpufreq_generic_frequency_table_verify,
 	.target_index	= acpi_cpufreq_target,
+	.fast_switch	= acpi_cpufreq_fast_switch,
 	.bios_limit	= acpi_processor_get_bios_limit,
 	.init		= acpi_cpufreq_cpu_init,
 	.exit		= acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -82,6 +82,7 @@ struct cpufreq_policy {
 	void			*governor_data;
 	bool			governor_enabled; /* governor start/stop flag */
 	char			last_governor[CPUFREQ_NAME_LEN]; /* last governor used */
+	bool			fast_switch_possible;
 
 	struct work_struct	update; /* if update_policy() needs to be
 					 * called, but you're in IRQ context */
@@ -271,6 +272,8 @@ struct cpufreq_driver {
 				  unsigned int relation);	/* Deprecated */
 	int		(*target_index)(struct cpufreq_policy *policy,
 					unsigned int index);
+	unsigned int	(*fast_switch)(struct cpufreq_policy *policy,
+				       unsigned int target_freq);
 	/*
 	 * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
 	 * unset.
@@ -485,6 +488,8 @@ struct cpufreq_governor {
 };
 
 /* Pass a target to the cpufreq driver */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+				unsigned int target_freq);
 int cpufreq_driver_target(struct cpufreq_policy *policy,
 				 unsigned int target_freq,
 				 unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -1814,6 +1814,37 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
  *                              GOVERNORS                            *
  *********************************************************************/
 
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * This function must not be called if policy->fast_switch_possible is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback, the hardware configuration must be preserved.
+ */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+				unsigned int target_freq)
+{
+	unsigned int freq;
+
+	if (target_freq == policy->cur)
+		return;
+
+	freq = cpufreq_driver->fast_switch(policy, target_freq);
+	if (freq != CPUFREQ_ENTRY_INVALID) {
+		policy->cur = freq;
+		trace_cpu_frequency(freq, smp_processor_id());
+	}
+}
+
 /* Must set freqs->new to intermediate frequency */
 static int __target_intermediate(struct cpufreq_policy *policy,
 				 struct cpufreq_freqs *freqs, int index)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -613,15 +613,18 @@ static int cpufreq_governor_limits(struc
 
 	mutex_lock(&policy_dbs->timer_mutex);
 
-	if (policy->max < policy->cur)
-		__cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H);
-	else if (policy->min > policy->cur)
-		__cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L);
+	if (!policy_dbs->fast_switch_enabled) {
+		if (policy->max < policy->cur)
+			__cpufreq_driver_target(policy, policy->max,
+						CPUFREQ_RELATION_H);
+		else if (policy->min > policy->cur)
+			__cpufreq_driver_target(policy, policy->min,
+						CPUFREQ_RELATION_L);
+	}
 
 	gov_update_sample_delay(policy_dbs, 0);
 
 	mutex_unlock(&policy_dbs->timer_mutex);
-
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_schedutil.c
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -83,12 +83,23 @@ static unsigned int sugov_next_freq(stru
 static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
 				unsigned int next_freq)
 {
-	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
-
-	sg_policy->next_freq = next_freq;
 	policy_dbs->last_sample_time = time;
-	policy_dbs->work_in_progress = true;
-	irq_work_queue(&policy_dbs->irq_work);
+
+	if (policy_dbs->fast_switch_enabled) {
+		cpufreq_driver_fast_switch(policy_dbs->policy, next_freq);
+		/*
+		 * Restore the sample delay in case it has been set to 0 from
+		 * sysfs in the meantime.
+		 */
+		gov_update_sample_delay(policy_dbs,
+					policy_dbs->dbs_data->sampling_rate);
+	} else {
+		struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+		sg_policy->next_freq = next_freq;
+		policy_dbs->work_in_progress = true;
+		irq_work_queue(&policy_dbs->irq_work);
+	}
 }
 
 static void sugov_update_shared(struct update_util_data *data, u64 time,
@@ -188,6 +199,7 @@ static bool sugov_start(struct cpufreq_p
 
 	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
 	policy_dbs->last_sample_time = 0;
+	policy_dbs->fast_switch_enabled = policy->fast_switch_possible;
 
 	for_each_cpu(cpu, policy->cpus) {
 		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -124,6 +124,7 @@ struct policy_dbs_info {
 	/* Status indicators */
 	bool is_shared;		/* This object is used by multiple CPUs */
 	bool work_in_progress;	/* Work is being queued up or in progress */
+	bool fast_switch_enabled;	/* Switch frequencies from interrup context */
 };
 
 static inline void gov_update_sample_delay(struct policy_dbs_info *policy_dbs,

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-24 23:30     ` [RFC/RFT][PATCH v3 " Rafael J. Wysocki
@ 2016-02-25  9:08       ` Peter Zijlstra
  2016-02-25  9:12         ` Peter Zijlstra
  2016-02-25 11:10         ` Rafael J. Wysocki
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2016-02-25  9:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Steve Muckle, Ingo Molnar

On Thu, Feb 25, 2016 at 12:30:43AM +0100, Rafael J. Wysocki wrote:
> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> +				      unsigned int target_freq)
> +{
> +	struct acpi_cpufreq_data *data = policy->driver_data;
> +	struct cpufreq_frequency_table *entry;
> +	struct acpi_processor_performance *perf;
> +	unsigned int uninitialized_var(next_perf_state);
> +	unsigned int uninitialized_var(next_freq);
> +	unsigned int best_diff;
> +
> +	for (entry = data->freq_table, best_diff = UINT_MAX;
> +	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
> +		unsigned int diff, freq = entry->frequency;
> +
> +		if (freq == CPUFREQ_ENTRY_INVALID)
> +			continue;
> +
> +		diff = abs(freq - target_freq);

Why would you consider frequencies that are below where you want to be?

> +		if (diff >= best_diff)
> +			continue;
> +
> +		best_diff = diff;
> +		next_perf_state = entry->driver_data;
> +		next_freq = freq;
> +		if (best_diff == 0)
> +			goto found;
> +	}
> +	if (best_diff == UINT_MAX)
> +		return CPUFREQ_ENTRY_INVALID;
> +
> + found:
> +	perf = to_perf_data(data);
> +	if (perf->state == next_perf_state) {
> +		if (unlikely(data->resume))
> +			data->resume = 0;
> +		else
> +			return next_freq;
> +	}
> +
> +	data->cpu_freq_fast_write(perf->states[next_perf_state].control);
> +	perf->state = next_perf_state;
> +	return next_freq;
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-25  9:08       ` Peter Zijlstra
@ 2016-02-25  9:12         ` Peter Zijlstra
  2016-02-25 11:11           ` Rafael J. Wysocki
  2016-02-25 11:10         ` Rafael J. Wysocki
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2016-02-25  9:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Steve Muckle, Ingo Molnar

On Thu, Feb 25, 2016 at 10:08:40AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 12:30:43AM +0100, Rafael J. Wysocki wrote:
> > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > +				      unsigned int target_freq)
> > +{
> > +	struct acpi_cpufreq_data *data = policy->driver_data;
> > +	struct cpufreq_frequency_table *entry;
> > +	struct acpi_processor_performance *perf;
> > +	unsigned int uninitialized_var(next_perf_state);
> > +	unsigned int uninitialized_var(next_freq);
> > +	unsigned int best_diff;
> > +
> > +	for (entry = data->freq_table, best_diff = UINT_MAX;
> > +	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
> > +		unsigned int diff, freq = entry->frequency;
> > +
> > +		if (freq == CPUFREQ_ENTRY_INVALID)
> > +			continue;
> > +
> > +		diff = abs(freq - target_freq);
> 
> Why would you consider frequencies that are below where you want to be?

Also, if you look for the first largest freq (which would make most
sense I think) and this table is sorted, you can do a binary search.

Then again, not sure the table is big enough to make that worth it.

> > +		if (diff >= best_diff)
> > +			continue;
> > +
> > +		best_diff = diff;
> > +		next_perf_state = entry->driver_data;
> > +		next_freq = freq;
> > +		if (best_diff == 0)
> > +			goto found;
> > +	}
> > +	if (best_diff == UINT_MAX)
> > +		return CPUFREQ_ENTRY_INVALID;
> > +
> > + found:
> > +	perf = to_perf_data(data);
> > +	if (perf->state == next_perf_state) {
> > +		if (unlikely(data->resume))
> > +			data->resume = 0;
> > +		else
> > +			return next_freq;
> > +	}
> > +
> > +	data->cpu_freq_fast_write(perf->states[next_perf_state].control);
> > +	perf->state = next_perf_state;
> > +	return next_freq;
> > +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-22 23:02     ` Rafael J. Wysocki
  2016-02-23  7:20       ` Steve Muckle
@ 2016-02-25 11:01       ` Juri Lelli
  2016-02-26  2:36         ` Rafael J. Wysocki
  1 sibling, 1 reply; 32+ messages in thread
From: Juri Lelli @ 2016-02-25 11:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Steve Muckle,
	Ingo Molnar

On 23/02/16 00:02, Rafael J. Wysocki wrote:
> On Mon, Feb 22, 2016 at 3:16 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi Rafael,
> 
> Hi,
> 

Sorry, my reply to this got delayed a bit.

> > thanks for this RFC. I'm going to test it more in the next few days, but
> > I already have some questions from skimming through it. Please find them
> > inline below.
> >
> > On 22/02/16 00:18, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Add a new cpufreq scaling governor, called "schedutil", that uses
> >> scheduler-provided CPU utilization information as input for making
> >> its decisions.
> >>
> >
> > I guess the first (macro) question is why did you decide to go with a
> > complete new governor, where new here is w.r.t. the sched-freq solution.
> 
> Probably the most comprehensive answer to this question is my intro
> message: http://marc.info/?l=linux-pm&m=145609673008122&w=2
> 
> The executive summary is probably that this was the most
> straightforward way to use the scheduler-provided numbers in cpufreq
> that I could think about.
> 
> > AFAICT, it is true that your solution directly builds on top of the
> > latest changes to cpufreq core and governor, but it also seems to have
> > more than a few points in common with sched-freq,
> 
> That surely isn't a drawback, is it?
> 

Not at all. I guess that I was simply wondering why you felt that a new
approach was required. But you explain this below. :-)

> If two people come to the same conclusions in different ways, that's
> an indication that the conclusions may actually be correct.
> 
> > and sched-freq has been discussed and evaluated for already quite some time.
> 
> Yes, it has.
> 
> Does this mean that no one is allowed to try any alternatives to it now?
>

Of course not. I'm mostly inline with what Steve replied here. But yes,
I think that we can only gain better understanding by reviewing both
RFCs.

> > Also, it appears to me that they both shares (or they might encounter in the
> > future as development progresses) the same kind of problems, like for
> > example the fact that we can't trigger opp changes from scheduler
> > context ATM.
> 
> "Give them a finger and they will ask for the hand."
> 

I'm sorry if you felt that I was asking too much from an RFC. I wasn't
in fact, what I wanted to say is that the two alternatives seemed to
share the same kind of problems. Well, now it seems that you have
already proposed a solution for one of them. :-)

> If you read my intro message linked above, you'll find a paragraph or
> two about that in it.
> 
> And the short summary is that I have a plan to actually implement that
> feature in the schedutil governor at least for the ACPI cpufreq
> driver.  It shouldn't be too difficult to do either AFAICS.  So it is
> not "we can't", but rather "we haven't implemented that yet" in this
> particular case.
> 
> I may not be able to do that in the next few days, as I have other
> things to do too, but you may expect to see that done at one point.
> 
> So it's not a fundamental issue or anything, it's just that I haven't
> done that *yet* at this point, OK?
> 

Sure. I saw what you are proposing to solve this. I'll reply to that
patch if I'll have any comments.

> > Don't get me wrong. I think that looking at different ways to solve a
> > problem is always beneficial, since I guess that the goal in the end is
> > to come up with something that suits everybody's needs.
> 
> Precisely.
> 
> > I was only curious about your thoughts on sched-freq. But we can also wait for the
> > next RFC from Steve for this macro question. :-)
> 
> Right, but I have some thoughts anyway.
> 
> My goal, that may be quite different from yours, is to reduce the
> cpufreq's overhead as much as I possibly can.  If I have to change the
> way it drives the CPU frequency selection to achieve that goal, I will
> do that, but if that can stay the way it is, that's fine too.
> 

As Steve already said, this was not our primary goal. But it is for sure
beneficail for everybody.

> Some progress has been made already here: we have dealt with the
> timers for good now I think.
> 
> This patch deals with the overhead associated with the load tracking
> carried by "traditional" cpufreq governors and with a couple of
> questionable things done by "ondemand" in addition to that (which is
> one of the reasons why I didn't want to modify "ondemand" itself for
> now).
> 
> The next step will be to teach the governor and the ACPI driver to
> switch CPU frequencies in the scheduler context, without spawning
> extra work items etc.
> 
> Finally, the sampling should go away and that's where I want it to be.
> 
> I just don't want to run extra code when that's not necessary and I
> want things to stay simple when that's as good as it can get.  If
> sched-freq can pull that off for me, that's fine, but can it really?
> 
> > [...]
> >
> >> +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> >> +                             unsigned int next_freq)
> >> +{
> >> +     struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> >> +
> >> +     sg_policy->next_freq = next_freq;
> >> +     policy_dbs->last_sample_time = time;
> >> +     policy_dbs->work_in_progress = true;
> >> +     irq_work_queue(&policy_dbs->irq_work);
> >
> > Here we basically use the system wq to be able to do the freq transition
> > in process context. CFS is probably fine with this, but don't you think
> > we might get into troubles when, in the future, we will want to service
> > RT/DL requests more properly and they will end up being serviced
> > together with all the others wq users and at !RT priority?
> 
> That may be regarded as a problem, but I'm not sure why you're talking
> about it in the context of this particular patch.  That problem has
> been there forever in cpufreq: in theory RT tasks may stall frequency
> changes indefinitely.
> 
> Is the problem real, though?
> 
> Suppose that that actually happens and there are RT tasks effectively
> stalling frequency updates.  In that case some other important
> activities of the kernel would be stalled too.  Pretty much everything
> run out of regular workqueues would be affected.
> 
> OK, so do we need to address that somehow?  Maybe, but then we should
> take care of all of the potentially affected stuff and not about the
> frequency changes only, shouldn't we?
> 

Ideally yes I'd say. But since we are at it with what reagrds frequency
changes, I thought we could spend some more time understading if we can
achieve something that is better that what we have today.

> Moreover, I don't really think that having a separate RT process for
> every CPU in the system just for this purpose is the right approach.
> It's just way overkill IMO and doesn't cover the other potentially
> affected stuff at all.
> 
> >> +}
> >> +
> >> +static void sugov_update_shared(struct update_util_data *data, u64 time,
> >> +                             unsigned long util, unsigned long max)
> >> +{
> >
> > We don't have a way to tell from which scheduling class this is coming
> > from, do we? And if that is true can't a request from CFS overwrite
> > RT/DL go to max requests?
> 
> Yes, it can, but we get updated when the CPU is updating its own rq
> only, so if my understanding of things is correct, an update from a
> different sched class means that the given class is in charge now.
> 

That is right. But, can't an higher priority class eat all the needed
capacity. I mean, suppose that both CFS and DL need 30% of CPU capacity
on the same CPU. DL wins and gets its 30% of capacity. When CFS gets to
run it's too late for requesting anything more (w.r.t. the same time
window). If we somehow aggregate requests instead, we could request 60%
and both classes can have their capacity to run. It seems to me that
this is what governors were already doing by using the 1 - idle metric.

Best,

- Juri

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-25  9:08       ` Peter Zijlstra
  2016-02-25  9:12         ` Peter Zijlstra
@ 2016-02-25 11:10         ` Rafael J. Wysocki
  2016-02-25 11:52           ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 11:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Steve Muckle, Ingo Molnar

On Thursday, February 25, 2016 10:08:40 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 12:30:43AM +0100, Rafael J. Wysocki wrote:
> > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > +				      unsigned int target_freq)
> > +{
> > +	struct acpi_cpufreq_data *data = policy->driver_data;
> > +	struct cpufreq_frequency_table *entry;
> > +	struct acpi_processor_performance *perf;
> > +	unsigned int uninitialized_var(next_perf_state);
> > +	unsigned int uninitialized_var(next_freq);
> > +	unsigned int best_diff;
> > +
> > +	for (entry = data->freq_table, best_diff = UINT_MAX;
> > +	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
> > +		unsigned int diff, freq = entry->frequency;
> > +
> > +		if (freq == CPUFREQ_ENTRY_INVALID)
> > +			continue;
> > +
> > +		diff = abs(freq - target_freq);
> 
> Why would you consider frequencies that are below where you want to be?

Say you have 800 MHz and 1600 MHz to choose from and the request if for
900 MHz.  The other may be way off (and different voltage for that matter).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-25  9:12         ` Peter Zijlstra
@ 2016-02-25 11:11           ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Steve Muckle, Ingo Molnar

On Thursday, February 25, 2016 10:12:49 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 10:08:40AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 25, 2016 at 12:30:43AM +0100, Rafael J. Wysocki wrote:
> > > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > +				      unsigned int target_freq)
> > > +{
> > > +	struct acpi_cpufreq_data *data = policy->driver_data;
> > > +	struct cpufreq_frequency_table *entry;
> > > +	struct acpi_processor_performance *perf;
> > > +	unsigned int uninitialized_var(next_perf_state);
> > > +	unsigned int uninitialized_var(next_freq);
> > > +	unsigned int best_diff;
> > > +
> > > +	for (entry = data->freq_table, best_diff = UINT_MAX;
> > > +	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
> > > +		unsigned int diff, freq = entry->frequency;
> > > +
> > > +		if (freq == CPUFREQ_ENTRY_INVALID)
> > > +			continue;
> > > +
> > > +		diff = abs(freq - target_freq);
> > 
> > Why would you consider frequencies that are below where you want to be?
> 
> Also, if you look for the first largest freq (which would make most
> sense I think) and this table is sorted, you can do a binary search.
> 
> Then again, not sure the table is big enough to make that worth it.

Yeah, that was my thought too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-25 11:10         ` Rafael J. Wysocki
@ 2016-02-25 11:52           ` Peter Zijlstra
  2016-02-25 20:54             ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2016-02-25 11:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Steve Muckle, Ingo Molnar

On Thu, Feb 25, 2016 at 12:10:48PM +0100, Rafael J. Wysocki wrote:
> On Thursday, February 25, 2016 10:08:40 AM Peter Zijlstra wrote:
> > On Thu, Feb 25, 2016 at 12:30:43AM +0100, Rafael J. Wysocki wrote:
> > > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > +				      unsigned int target_freq)
> > > +{
> > > +	struct acpi_cpufreq_data *data = policy->driver_data;
> > > +	struct cpufreq_frequency_table *entry;
> > > +	struct acpi_processor_performance *perf;
> > > +	unsigned int uninitialized_var(next_perf_state);
> > > +	unsigned int uninitialized_var(next_freq);
> > > +	unsigned int best_diff;
> > > +
> > > +	for (entry = data->freq_table, best_diff = UINT_MAX;
> > > +	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
> > > +		unsigned int diff, freq = entry->frequency;
> > > +
> > > +		if (freq == CPUFREQ_ENTRY_INVALID)
> > > +			continue;
> > > +
> > > +		diff = abs(freq - target_freq);
> > 
> > Why would you consider frequencies that are below where you want to be?
> 
> Say you have 800 MHz and 1600 MHz to choose from and the request if for
> 900 MHz.  The other may be way off (and different voltage for that matter).

Are there really chips with such crappy choices? That said, for some
scenarios you really do have to pick 1600 because otherwise the work
will not be able to complete in time and the whole purpose of the
machine is moot.

That argues for more than a target frequency argument.

Furthermore, depending on the idle capabilities of the platform, 1600
might still be the better choice, it gives idle time in which it could
power gate the complete thing, still yielding better perf/watt than 100%
pegged at 800.

So I'm not at all sure the nearest freq is a sane general policy.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v3 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-25 11:52           ` Peter Zijlstra
@ 2016-02-25 20:54             ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Steve Muckle, Ingo Molnar

On Thursday, February 25, 2016 12:52:34 PM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 12:10:48PM +0100, Rafael J. Wysocki wrote:
> > On Thursday, February 25, 2016 10:08:40 AM Peter Zijlstra wrote:
> > > On Thu, Feb 25, 2016 at 12:30:43AM +0100, Rafael J. Wysocki wrote:
> > > > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > > +				      unsigned int target_freq)
> > > > +{
> > > > +	struct acpi_cpufreq_data *data = policy->driver_data;
> > > > +	struct cpufreq_frequency_table *entry;
> > > > +	struct acpi_processor_performance *perf;
> > > > +	unsigned int uninitialized_var(next_perf_state);
> > > > +	unsigned int uninitialized_var(next_freq);
> > > > +	unsigned int best_diff;
> > > > +
> > > > +	for (entry = data->freq_table, best_diff = UINT_MAX;
> > > > +	     entry->frequency != CPUFREQ_TABLE_END; entry++) {
> > > > +		unsigned int diff, freq = entry->frequency;
> > > > +
> > > > +		if (freq == CPUFREQ_ENTRY_INVALID)
> > > > +			continue;
> > > > +
> > > > +		diff = abs(freq - target_freq);
> > > 
> > > Why would you consider frequencies that are below where you want to be?
> > 
> > Say you have 800 MHz and 1600 MHz to choose from and the request if for
> > 900 MHz.  The other may be way off (and different voltage for that matter).
> 
> Are there really chips with such crappy choices?

One of my test boxes has three: 1330, 1060 and 800 MHz.

> That said, for some scenarios you really do have to pick 1600 because
> otherwise the work will not be able to complete in time and the whole
> purpose of the machine is moot.
> 
> That argues for more than a target frequency argument.
> 
> Furthermore, depending on the idle capabilities of the platform, 1600
> might still be the better choice, it gives idle time in which it could
> power gate the complete thing, still yielding better perf/watt than 100%
> pegged at 800.
> 
> So I'm not at all sure the nearest freq is a sane general policy.

OK

I thought that changing it to selecting the closest frequency above the
target would make us practically avoid the min, because then we would only
get to it if we were asked for it explicitly.  Nevertheless, I thought I would
try it anyway, so I ran that algo on the test box mentioned above.  And pretty
much as expected I ended up with marginal residency in the 800 MHz state
(below 1%) even if the system was very lightly loaded and the vast majority
of time was spent in the 1060 MHz one. So that wasn't a desirable outcome.

Then, I changed the algorithm to look for the closest frequencies above (f_up)
and below (f_down) the target and then choose f_down if

	f_down + ((f_up - f_down) / 4) > target

and f_up otherwise.  That still skews the choice towards higher frequencies,
but not as much and the 800 MHz residency on the almost idle system has mostly
come back with that modification.

So I'll send an update of the $subject patch with that implemented in case
someone wants to see how it goes.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-02-24  1:22   ` [RFC/RFT][PATCH v2 1/2] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
@ 2016-02-25 21:14     ` Rafael J. Wysocki
  2016-02-27  0:21       ` Rafael J. Wysocki
  2016-02-27  4:33       ` Steve Muckle
  0 siblings, 2 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 21:14 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is very simple.  It is almost as simple as it
can be and remain reasonably functional.

The frequency selection formula used by it is essentially the same
as the one used by the "ondemand" governor, although it doesn't use
the additional up_threshold parameter, but instead of computing the
load as the "non-idle CPU time" to "total CPU time" ratio, it takes
the utilization data provided by CFS as input.  More specifically,
it represents "load" as the util/max ratio, where util and max
are the utilization and CPU capacity coming from CFS.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).  The only operation carried
out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
is a __cpufreq_driver_target() call to trigger a frequency update (to
a value already computed beforehand in one of the utilization update
handlers).  This means that, at least for some cpufreq drivers that
can update CPU frequency by doing simple register writes, it should
be possible to set the frequency in the utilization update handlers
too in which case all of the governor's activity would take place in
the scheduler paths invoking cpufreq_update_util() without the need
to run anything in process context.

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
specifically targeted at RT and DL tasks.

To some extent it relies on the common governor code in
cpufreq_governor.c and it uses that code in a somewhat unusual
way (different from what the "ondemand" and "conservative"
governors do), so some small and rather unintrusive changes
have to be made in that code and the other governors to support it.

However, after making it possible to set the CPU frequency from
the utilization update handlers, that new governor's interactions
with the common code might be limited to the initialization, cleanup
and handling of sysfs attributes (currently only one attribute,
sampling_rate, is supported in addition to the standard policy
attributes handled by the cpufreq core).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.

Changes from v2:
- Avoid requesting the same frequency that was requested last time for
  the given policy.

Changes from v1:
- Use policy->min and policy->max as min/max frequency in computations.

---
 drivers/cpufreq/Kconfig                |   15 +
 drivers/cpufreq/Makefile               |    1 
 drivers/cpufreq/cpufreq_conservative.c |    3 
 drivers/cpufreq/cpufreq_governor.c     |   21 +-
 drivers/cpufreq/cpufreq_governor.h     |    2 
 drivers/cpufreq/cpufreq_ondemand.c     |    3 
 drivers/cpufreq/cpufreq_schedutil.c    |  253 +++++++++++++++++++++++++++++++++
 7 files changed, 288 insertions(+), 10 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -164,7 +164,7 @@ struct dbs_governor {
 	void (*free)(struct policy_dbs_info *policy_dbs);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
-	void (*start)(struct cpufreq_policy *policy);
+	bool (*start)(struct cpufreq_policy *policy);
 };
 
 static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,253 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/percpu-defs.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_policy {
+	struct policy_dbs_info policy_dbs;
+	unsigned int next_freq;
+	raw_spinlock_t update_lock;  /* For shared policies */
+};
+
+static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
+{
+	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
+}
+
+struct sugov_cpu {
+	struct update_util_data update_util;
+	struct policy_dbs_info *policy_dbs;
+	/* The fields below are only needed when sharing a policy. */
+	unsigned long util;
+	unsigned long max;
+	u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
+				    unsigned long util, unsigned long max,
+				    u64 last_sample_time)
+{
+	struct cpufreq_policy *policy = policy_dbs->policy;
+	unsigned int min_f = policy->min;
+	unsigned int max_f = policy->max;
+	unsigned int j;
+
+	if (util > max || min_f >= max_f)
+		return max_f;
+
+	for_each_cpu(j, policy->cpus) {
+		struct sugov_cpu *j_sg_cpu;
+		unsigned long j_util, j_max;
+
+		if (j == smp_processor_id())
+			continue;
+
+		j_sg_cpu = &per_cpu(sugov_cpu, j);
+		/*
+		 * If the CPU was last updated before the previous sampling
+		 * time, we don't take it into account as it probably is idle
+		 * now.
+		 */
+		if (j_sg_cpu->last_update < last_sample_time)
+			continue;
+
+		j_util = j_sg_cpu->util;
+		j_max = j_sg_cpu->max;
+		if (j_util > j_max)
+			return max_f;
+
+		if (j_util * max > j_max * util) {
+			util = j_util;
+			max = j_max;
+		}
+	}
+
+	return min_f + util * (max_f - min_f) / max;
+}
+
+static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
+				unsigned int next_freq)
+{
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	policy_dbs->last_sample_time = time;
+	if (sg_policy->next_freq != next_freq) {
+		sg_policy->next_freq = next_freq;
+		policy_dbs->work_in_progress = true;
+		irq_work_queue(&policy_dbs->irq_work);
+	}
+}
+
+static void sugov_update_shared(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+	unsigned int next_f;
+	u64 delta_ns, lst;
+
+	raw_spin_lock(&sg_policy->update_lock);
+
+	sg_cpu->util = util;
+	sg_cpu->max = max;
+	sg_cpu->last_update = time;
+
+	if (policy_dbs->work_in_progress)
+		goto out;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	lst = policy_dbs->last_sample_time;
+	delta_ns = time - lst;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		goto out;
+
+	next_f = sugov_next_freq(policy_dbs, util, max, lst);
+
+	sugov_update_commit(policy_dbs, time, next_f);
+
+ out:
+	raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_update_single(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	unsigned int min_f, max_f, next_f;
+	u64 delta_ns;
+
+	if (policy_dbs->work_in_progress)
+		return;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	delta_ns = time - policy_dbs->last_sample_time;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		return;
+
+	min_f = policy_dbs->policy->min;
+	max_f = policy_dbs->policy->max;
+	next_f = util > max || min_f >= max_f ? max_f :
+			min_f + util * (max_f - min_f) / max;
+
+	sugov_update_commit(policy_dbs, time, next_f);
+}
+
+/************************** sysfs interface ************************/
+
+gov_show_one_common(sampling_rate);
+
+gov_attr_rw(sampling_rate);
+
+static struct attribute *sugov_attributes[] = {
+	&sampling_rate.attr,
+	NULL
+};
+
+/********************** dbs_governor interface *********************/
+
+static struct policy_dbs_info *sugov_alloc(void)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+	if (!sg_policy)
+		return NULL;
+
+	raw_spin_lock_init(&sg_policy->update_lock);
+	return &sg_policy->policy_dbs;
+}
+
+static void sugov_free(struct policy_dbs_info *policy_dbs)
+{
+	kfree(to_sg_policy(policy_dbs));
+}
+
+static bool sugov_start(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+	unsigned int cpu;
+
+	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
+	policy_dbs->last_sample_time = 0;
+	sg_policy->next_freq = UINT_MAX;
+
+	for_each_cpu(cpu, policy->cpus) {
+		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+		sg_cpu->policy_dbs = policy_dbs;
+		if (policy_dbs->is_shared) {
+			sg_cpu->util = ULONG_MAX;
+			sg_cpu->max = 0;
+			sg_cpu->last_update = 0;
+			sg_cpu->update_util.func = sugov_update_shared;
+		} else {
+			sg_cpu->update_util.func = sugov_update_single;
+		}
+		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
+	}
+	return false;
+}
+
+static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
+	return policy_dbs->dbs_data->sampling_rate;
+}
+
+static struct dbs_governor schedutil_gov = {
+	.gov = {
+		.name = "schedutil",
+		.governor = cpufreq_governor_dbs,
+		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
+		.owner = THIS_MODULE,
+	},
+	.kobj_type = { .default_attrs = sugov_attributes },
+	.gov_dbs_timer = sugov_set_freq,
+	.alloc = sugov_alloc,
+	.free = sugov_free,
+	.start = sugov_start,
+};
+
+#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
+
+static int __init sugov_init(void)
+{
+	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+static void __exit sugov_exit(void)
+{
+	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+module_init(sugov_init);
+module_exit(sugov_exit);
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void cs_start(struct cpufreq_policy *policy)
+static bool cs_start(struct cpufreq_policy *policy)
 {
 	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->down_skip = 0;
 	dbs_info->requested_freq = policy->cur;
+	return true;
 }
 
 static struct dbs_governor cs_dbs_gov = {
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -472,9 +472,11 @@ static int cpufreq_governor_init(struct
 	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
 	mutex_init(&dbs_data->mutex);
 
-	ret = gov->init(dbs_data, !policy->governor->initialized);
-	if (ret)
-		goto free_policy_dbs_info;
+	if (gov->init) {
+		ret = gov->init(dbs_data, !policy->governor->initialized);
+		if (ret)
+			goto free_policy_dbs_info;
+	}
 
 	/* policy latency is in ns. Convert it to us first */
 	latency = policy->cpuinfo.transition_latency / 1000;
@@ -510,7 +512,10 @@ static int cpufreq_governor_init(struct
 
 	if (!have_governor_per_policy())
 		gov->gdbs_data = NULL;
-	gov->exit(dbs_data, !policy->governor->initialized);
+
+	if (gov->exit)
+		gov->exit(dbs_data, !policy->governor->initialized);
+
 	kfree(dbs_data);
 
 free_policy_dbs_info:
@@ -544,7 +549,9 @@ static int cpufreq_governor_exit(struct
 		if (!have_governor_per_policy())
 			gov->gdbs_data = NULL;
 
-		gov->exit(dbs_data, policy->governor->initialized == 1);
+		if (gov->exit)
+			gov->exit(dbs_data, policy->governor->initialized == 1);
+
 		mutex_destroy(&dbs_data->mutex);
 		kfree(dbs_data);
 	} else {
@@ -588,9 +595,9 @@ static int cpufreq_governor_start(struct
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 	}
 
-	gov->start(policy);
+	if (gov->start(policy))
+		gov_set_update_util(policy_dbs, sampling_rate);
 
-	gov_set_update_util(policy_dbs, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void od_start(struct cpufreq_policy *policy)
+static bool od_start(struct cpufreq_policy *policy)
 {
 	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	ondemand_powersave_bias_init(policy);
+	return true;
 }
 
 static struct od_ops od_ops = {
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHEDUTIL
+	tristate "'schedutil' cpufreq policy governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	help
+	  The frequency selection formula used by this governor is analogous
+	  to the one used by 'ondemand', but instead of computing CPU load
+	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+	  utilization data provided by the scheduler as input.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_conservative.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
 obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
 obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
 obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
 obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
 
 obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH v4 2/2] cpufreq: schedutil: Switching frequencies from interrupt context
  2016-02-24  1:28   ` [RFC/RFT][PATCH v2 2/2] cpufreq: schedutil: Switching frequencies from interrupt context Rafael J. Wysocki
  2016-02-24 23:30     ` [RFC/RFT][PATCH v3 " Rafael J. Wysocki
@ 2016-02-25 21:20     ` Rafael J. Wysocki
  1 sibling, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 21:20 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpufreq: schedutil: Switching frequencies from interrupt context

Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
and the schedutil governor to use that method if available.

Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context via
new helper function cpufreq_driver_fast_switch().

Modify the schedutil governor to call cpufreq_driver_fast_switch()
from its sugov_update_commit() function and avoid queuing up the
irq_work if that is successful.

Implement the ->fast_switch callback in the ACPI cpufreq driver
(with a limited coverage for the time being).

In addition to the above, cpufreq_governor_limits() is modified so
it doesn't call __cpufreq_driver_target() to enforce the new limits
immediately if the fast_switch_enabled flag is set for the policy,
because in that case the frequency will be updated immediately
using the new limits anyway and the additional invocation of
__cpufreq_driver_target() might be racing with that violating
the cpufreq_driver_fast_switch() requirements.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Changes from v3:
- Rebase on the v4 of patch [1/2].
- Change the table lookup algo in acpi_cpufreq_fast_switch() to take the table
  ordering into account and to prefer higher frequencies (if the target falls
  between two of them).

Changes from v2:
- Rework due to the revised handling of policy->min and policy->max changes.

v2 was the first version of this patch.

---
 drivers/cpufreq/acpi-cpufreq.c      |   63 ++++++++++++++++++++++++++++++++++++
 drivers/cpufreq/cpufreq.c           |   31 +++++++++++++++++
 drivers/cpufreq/cpufreq_governor.c  |   13 ++++---
 drivers/cpufreq/cpufreq_governor.h  |    1 
 drivers/cpufreq/cpufreq_schedutil.c |   16 ++++++++-
 include/linux/cpufreq.h             |    5 ++
 6 files changed, 122 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -70,6 +70,7 @@ struct acpi_cpufreq_data {
 	unsigned int cpu_feature;
 	unsigned int acpi_perf_cpu;
 	cpumask_var_t freqdomain_cpus;
+	void (*cpu_freq_fast_write)(u32 val);
 };
 
 /* acpi_perf_data is a pointer to percpu data. */
@@ -243,6 +244,15 @@ static unsigned extract_freq(u32 val, st
 	}
 }
 
+void cpu_freq_fast_write_intel(u32 val)
+{
+	u32 lo, hi;
+
+	rdmsr(MSR_IA32_PERF_CTL, lo, hi);
+	lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE);
+	wrmsr(MSR_IA32_PERF_CTL, lo, hi);
+}
+
 struct msr_addr {
 	u32 reg;
 };
@@ -484,6 +494,53 @@ out:
 	return result;
 }
 
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+				      unsigned int target_freq)
+{
+	struct acpi_cpufreq_data *data = policy->driver_data;
+	struct acpi_processor_performance *perf;
+	struct cpufreq_frequency_table *entry, *next;
+	unsigned int next_perf_state, next_freq, freq;
+
+	/*
+	 * Find the closest frequency above target_freq or equal to it.
+	 *
+	 * The table is sorted in the reverse order with respect to the
+	 * frequency and all of the entires are valid (see the initialization).
+	 */
+	entry = data->freq_table;
+	do {
+		next = entry;
+		entry++;
+		freq = entry->frequency;
+	} while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+
+	next_freq = next->frequency;
+	/*
+	 * The frequency below the target may be more suitable, so check and
+	 * choose it if that's the case.
+	 */
+	if (next_freq != target_freq && freq != CPUFREQ_TABLE_END &&
+	    (freq + ((next_freq - freq) >> 2) > target_freq)) {
+		next_freq = freq;
+		next_perf_state = entry->driver_data;
+	} else {
+		next_perf_state = next->driver_data;
+	}
+
+	perf = to_perf_data(data);
+	if (perf->state == next_perf_state) {
+		if (unlikely(data->resume))
+			data->resume = 0;
+		else
+			return next_freq;
+	}
+
+	data->cpu_freq_fast_write(perf->states[next_perf_state].control);
+	perf->state = next_perf_state;
+	return next_freq;
+}
+
 static unsigned long
 acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
 {
@@ -745,6 +802,7 @@ static int acpi_cpufreq_cpu_init(struct
 		pr_debug("HARDWARE addr space\n");
 		if (check_est_cpu(cpu)) {
 			data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE;
+			data->cpu_freq_fast_write = cpu_freq_fast_write_intel;
 			break;
 		}
 		if (check_amd_hwpstate_cpu(cpu)) {
@@ -760,6 +818,10 @@ static int acpi_cpufreq_cpu_init(struct
 		goto err_unreg;
 	}
 
+	policy->fast_switch_possible = data->cpu_freq_fast_write != NULL &&
+		!acpi_pstate_strict && !(policy_is_shared(policy) &&
+			policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
 	data->freq_table = kzalloc(sizeof(*data->freq_table) *
 		    (perf->state_count+1), GFP_KERNEL);
 	if (!data->freq_table) {
@@ -894,6 +956,7 @@ static struct freq_attr *acpi_cpufreq_at
 static struct cpufreq_driver acpi_cpufreq_driver = {
 	.verify		= cpufreq_generic_frequency_table_verify,
 	.target_index	= acpi_cpufreq_target,
+	.fast_switch	= acpi_cpufreq_fast_switch,
 	.bios_limit	= acpi_processor_get_bios_limit,
 	.init		= acpi_cpufreq_cpu_init,
 	.exit		= acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -82,6 +82,7 @@ struct cpufreq_policy {
 	void			*governor_data;
 	bool			governor_enabled; /* governor start/stop flag */
 	char			last_governor[CPUFREQ_NAME_LEN]; /* last governor used */
+	bool			fast_switch_possible;
 
 	struct work_struct	update; /* if update_policy() needs to be
 					 * called, but you're in IRQ context */
@@ -271,6 +272,8 @@ struct cpufreq_driver {
 				  unsigned int relation);	/* Deprecated */
 	int		(*target_index)(struct cpufreq_policy *policy,
 					unsigned int index);
+	unsigned int	(*fast_switch)(struct cpufreq_policy *policy,
+				       unsigned int target_freq);
 	/*
 	 * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
 	 * unset.
@@ -485,6 +488,8 @@ struct cpufreq_governor {
 };
 
 /* Pass a target to the cpufreq driver */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+				unsigned int target_freq);
 int cpufreq_driver_target(struct cpufreq_policy *policy,
 				 unsigned int target_freq,
 				 unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -1814,6 +1814,37 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
  *                              GOVERNORS                            *
  *********************************************************************/
 
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * This function must not be called if policy->fast_switch_possible is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback, the hardware configuration must be preserved.
+ */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+				unsigned int target_freq)
+{
+	unsigned int freq;
+
+	if (target_freq == policy->cur)
+		return;
+
+	freq = cpufreq_driver->fast_switch(policy, target_freq);
+	if (freq != CPUFREQ_ENTRY_INVALID) {
+		policy->cur = freq;
+		trace_cpu_frequency(freq, smp_processor_id());
+	}
+}
+
 /* Must set freqs->new to intermediate frequency */
 static int __target_intermediate(struct cpufreq_policy *policy,
 				 struct cpufreq_freqs *freqs, int index)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -613,15 +613,18 @@ static int cpufreq_governor_limits(struc
 
 	mutex_lock(&policy_dbs->timer_mutex);
 
-	if (policy->max < policy->cur)
-		__cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H);
-	else if (policy->min > policy->cur)
-		__cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L);
+	if (!policy_dbs->fast_switch_enabled) {
+		if (policy->max < policy->cur)
+			__cpufreq_driver_target(policy, policy->max,
+						CPUFREQ_RELATION_H);
+		else if (policy->min > policy->cur)
+			__cpufreq_driver_target(policy, policy->min,
+						CPUFREQ_RELATION_L);
+	}
 
 	gov_update_sample_delay(policy_dbs, 0);
 
 	mutex_unlock(&policy_dbs->timer_mutex);
-
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_schedutil.c
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -86,8 +86,19 @@ static void sugov_update_commit(struct p
 	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
 
 	policy_dbs->last_sample_time = time;
-	if (sg_policy->next_freq != next_freq) {
-		sg_policy->next_freq = next_freq;
+	if (sg_policy->next_freq == next_freq)
+		return;
+
+	sg_policy->next_freq = next_freq;
+	if (policy_dbs->fast_switch_enabled) {
+		cpufreq_driver_fast_switch(policy_dbs->policy, next_freq);
+		/*
+		 * Restore the sample delay in case it has been set to 0 from
+		 * sysfs in the meantime.
+		 */
+		gov_update_sample_delay(policy_dbs,
+					policy_dbs->dbs_data->sampling_rate);
+	} else {
 		policy_dbs->work_in_progress = true;
 		irq_work_queue(&policy_dbs->irq_work);
 	}
@@ -191,6 +202,7 @@ static bool sugov_start(struct cpufreq_p
 
 	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
 	policy_dbs->last_sample_time = 0;
+	policy_dbs->fast_switch_enabled = policy->fast_switch_possible;
 	sg_policy->next_freq = UINT_MAX;
 
 	for_each_cpu(cpu, policy->cpus) {
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -124,6 +124,7 @@ struct policy_dbs_info {
 	/* Status indicators */
 	bool is_shared;		/* This object is used by multiple CPUs */
 	bool work_in_progress;	/* Work is being queued up or in progress */
+	bool fast_switch_enabled;	/* Switch frequencies from interrup context */
 };
 
 static inline void gov_update_sample_delay(struct policy_dbs_info *policy_dbs,

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-25 11:01       ` Juri Lelli
@ 2016-02-26  2:36         ` Rafael J. Wysocki
  2016-03-01 14:56           ` Juri Lelli
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-26  2:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Steve Muckle,
	Ingo Molnar

On Thursday, February 25, 2016 11:01:20 AM Juri Lelli wrote:
> On 23/02/16 00:02, Rafael J. Wysocki wrote:
> > On Mon, Feb 22, 2016 at 3:16 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > > Hi Rafael,
> > 
> > Hi,
> > 
> 
> Sorry, my reply to this got delayed a bit.

No problem at all.

> > > thanks for this RFC. I'm going to test it more in the next few days, but
> > > I already have some questions from skimming through it. Please find them
> > > inline below.
> > >
> > > On 22/02/16 00:18, Rafael J. Wysocki wrote:
> > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >>
> > >> Add a new cpufreq scaling governor, called "schedutil", that uses
> > >> scheduler-provided CPU utilization information as input for making
> > >> its decisions.
> > >>
> > >
> > > I guess the first (macro) question is why did you decide to go with a
> > > complete new governor, where new here is w.r.t. the sched-freq solution.
> > 
> > Probably the most comprehensive answer to this question is my intro
> > message: http://marc.info/?l=linux-pm&m=145609673008122&w=2
> > 
> > The executive summary is probably that this was the most
> > straightforward way to use the scheduler-provided numbers in cpufreq
> > that I could think about.
> > 
> > > AFAICT, it is true that your solution directly builds on top of the
> > > latest changes to cpufreq core and governor, but it also seems to have
> > > more than a few points in common with sched-freq,
> > 
> > That surely isn't a drawback, is it?
> > 
> 
> Not at all. I guess that I was simply wondering why you felt that a new
> approach was required. But you explain this below. :-)
> 
> > If two people come to the same conclusions in different ways, that's
> > an indication that the conclusions may actually be correct.
> > 
> > > and sched-freq has been discussed and evaluated for already quite some time.
> > 
> > Yes, it has.
> > 
> > Does this mean that no one is allowed to try any alternatives to it now?
> >
> 
> Of course not. I'm mostly inline with what Steve replied here. But yes,
> I think that we can only gain better understanding by reviewing both
> RFCs.
> 
> > > Also, it appears to me that they both shares (or they might encounter in the
> > > future as development progresses) the same kind of problems, like for
> > > example the fact that we can't trigger opp changes from scheduler
> > > context ATM.
> > 
> > "Give them a finger and they will ask for the hand."
> > 
> 
> I'm sorry if you felt that I was asking too much from an RFC. I wasn't
> in fact, what I wanted to say is that the two alternatives seemed to
> share the same kind of problems. Well, now it seems that you have
> already proposed a solution for one of them. :-)

That actually was a missing piece that I had planned to add from the outset, but
then decided to keep it simple to start with and omit that part until I can
clean up the ACPI driver enough to make it fit the whole picture more naturally.

I still want to do that which is mostly why I'm regarding that patch as a
prototype.

> > If you read my intro message linked above, you'll find a paragraph or
> > two about that in it.
> > 
> > And the short summary is that I have a plan to actually implement that
> > feature in the schedutil governor at least for the ACPI cpufreq
> > driver.  It shouldn't be too difficult to do either AFAICS.  So it is
> > not "we can't", but rather "we haven't implemented that yet" in this
> > particular case.
> > 
> > I may not be able to do that in the next few days, as I have other
> > things to do too, but you may expect to see that done at one point.
> > 
> > So it's not a fundamental issue or anything, it's just that I haven't
> > done that *yet* at this point, OK?
> > 
> 
> Sure. I saw what you are proposing to solve this. I'll reply to that
> patch if I'll have any comments.
> 
> > > Don't get me wrong. I think that looking at different ways to solve a
> > > problem is always beneficial, since I guess that the goal in the end is
> > > to come up with something that suits everybody's needs.
> > 
> > Precisely.
> > 
> > > I was only curious about your thoughts on sched-freq. But we can also wait for the
> > > next RFC from Steve for this macro question. :-)
> > 
> > Right, but I have some thoughts anyway.
> > 
> > My goal, that may be quite different from yours, is to reduce the
> > cpufreq's overhead as much as I possibly can.  If I have to change the
> > way it drives the CPU frequency selection to achieve that goal, I will
> > do that, but if that can stay the way it is, that's fine too.
> > 
> 
> As Steve already said, this was not our primary goal. But it is for sure
> beneficail for everybody.
> 
> > Some progress has been made already here: we have dealt with the
> > timers for good now I think.
> > 
> > This patch deals with the overhead associated with the load tracking
> > carried by "traditional" cpufreq governors and with a couple of
> > questionable things done by "ondemand" in addition to that (which is
> > one of the reasons why I didn't want to modify "ondemand" itself for
> > now).
> > 
> > The next step will be to teach the governor and the ACPI driver to
> > switch CPU frequencies in the scheduler context, without spawning
> > extra work items etc.
> > 
> > Finally, the sampling should go away and that's where I want it to be.
> > 
> > I just don't want to run extra code when that's not necessary and I
> > want things to stay simple when that's as good as it can get.  If
> > sched-freq can pull that off for me, that's fine, but can it really?
> > 
> > > [...]
> > >
> > >> +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> > >> +                             unsigned int next_freq)
> > >> +{
> > >> +     struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > >> +
> > >> +     sg_policy->next_freq = next_freq;
> > >> +     policy_dbs->last_sample_time = time;
> > >> +     policy_dbs->work_in_progress = true;
> > >> +     irq_work_queue(&policy_dbs->irq_work);
> > >
> > > Here we basically use the system wq to be able to do the freq transition
> > > in process context. CFS is probably fine with this, but don't you think
> > > we might get into troubles when, in the future, we will want to service
> > > RT/DL requests more properly and they will end up being serviced
> > > together with all the others wq users and at !RT priority?
> > 
> > That may be regarded as a problem, but I'm not sure why you're talking
> > about it in the context of this particular patch.  That problem has
> > been there forever in cpufreq: in theory RT tasks may stall frequency
> > changes indefinitely.
> > 
> > Is the problem real, though?
> > 
> > Suppose that that actually happens and there are RT tasks effectively
> > stalling frequency updates.  In that case some other important
> > activities of the kernel would be stalled too.  Pretty much everything
> > run out of regular workqueues would be affected.
> > 
> > OK, so do we need to address that somehow?  Maybe, but then we should
> > take care of all of the potentially affected stuff and not about the
> > frequency changes only, shouldn't we?
> > 
> 
> Ideally yes I'd say. But since we are at it with what reagrds frequency
> changes, I thought we could spend some more time understading if we can
> achieve something that is better that what we have today.

That's a worthy goal in general, but the approach with using dedicated RT
threads is not my favourite to be honest.

One problem with it I can see is that if things are starverd by RT threads
already, adding more RT threads to the mix will only add to the starvation
of the *other* stuff which may need to make progress too for the system
to stay generally usable.

Plus I'm not sure how it prevents starvation that may be caused by FIFO RT
things from happening.

Having those threads per policy (and bound to the policy CPUs) is also rather
wasteful, because it is not even necessary to invoke __cpufreq_driver_target()
from one of the policy CPUs.  Drivers should just do the right thing if that's
not the case as the "old" governors don't guarantee that anyway. So maybe
something like a pool of threads that may run on any available CPU would work,
but that's so similar to the workqueue subsystem design that I'm not sure if
duplicating it is really a good idea. 

Quite honestly, the only way to avoid this problem in frequency scaling
I can see today is to do the switching of frequencies from the scheduler,
but there are systems where that's clearly impossible.  Tough one. :-)

> > Moreover, I don't really think that having a separate RT process for
> > every CPU in the system just for this purpose is the right approach.
> > It's just way overkill IMO and doesn't cover the other potentially
> > affected stuff at all.
> > 
> > >> +}
> > >> +
> > >> +static void sugov_update_shared(struct update_util_data *data, u64 time,
> > >> +                             unsigned long util, unsigned long max)
> > >> +{
> > >
> > > We don't have a way to tell from which scheduling class this is coming
> > > from, do we? And if that is true can't a request from CFS overwrite
> > > RT/DL go to max requests?
> > 
> > Yes, it can, but we get updated when the CPU is updating its own rq
> > only, so if my understanding of things is correct, an update from a
> > different sched class means that the given class is in charge now.
> > 
> 
> That is right. But, can't an higher priority class eat all the needed
> capacity. I mean, suppose that both CFS and DL need 30% of CPU capacity
> on the same CPU. DL wins and gets its 30% of capacity. When CFS gets to
> run it's too late for requesting anything more (w.r.t. the same time
> window). If we somehow aggregate requests instead, we could request 60%
> and both classes can have their capacity to run. It seems to me that
> this is what governors were already doing by using the 1 - idle metric.

That's interesting, because it is about a few different things at a time. :-)

So first of all the "old" governors only collect information about what
happened in the past and make decisions on that basis (kind of in the hope
that what happened once will happen again), while the idea behind what
you're describing seems to be to attempt to project future needs for
capacity and use that to make decisions (just for the very near future,
but that should be sufficient).  If successful, that would be the most
suitable approach IMO.

Of course, the $subject patch is not aspiring to anything of that kind.
It only uses information about current needs that's already available to
it in a very straightforward way.

But there's more to it.  In the sampling, or rate-limiting if you will,
situation you really have a window in which many things can happen and
making a good decision at the beginning of it is important.  However, if
you just can handle *every* request and really switch frequencies on the
fly, then each of them may come with a "currently needed capacity" number
and you can just give it what it asks for every time.

My point is that there are quite a few things to consider here and I'm
expecting a learning process to happen before we are happy with what we
have.  So my approach would be (and is) to start very simple and then
add more complexity over time as needed instead of just trying to address
every issue I can think about from the outset.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-02-25 21:14     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
@ 2016-02-27  0:21       ` Rafael J. Wysocki
  2016-02-27  4:33       ` Steve Muckle
  1 sibling, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-27  0:21 UTC (permalink / raw)
  To: Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Steve Muckle, Ingo Molnar

On Thursday, February 25, 2016 10:14:35 PM Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
> 
> The new governor is very simple.  It is almost as simple as it
> can be and remain reasonably functional.
> 
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
> 
> All of the computations are carried out in the utilization update
> handlers provided by the new governor.  One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).  The only operation carried
> out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> is a __cpufreq_driver_target() call to trigger a frequency update (to
> a value already computed beforehand in one of the utilization update
> handlers).  This means that, at least for some cpufreq drivers that
> can update CPU frequency by doing simple register writes, it should
> be possible to set the frequency in the utilization update handlers
> too in which case all of the governor's activity would take place in
> the scheduler paths invoking cpufreq_update_util() without the need
> to run anything in process context.
> 
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes.  That
> heavy-handed approach should be replaced with something more
> specifically targeted at RT and DL tasks.
> 
> To some extent it relies on the common governor code in
> cpufreq_governor.c and it uses that code in a somewhat unusual
> way (different from what the "ondemand" and "conservative"
> governors do), so some small and rather unintrusive changes
> have to be made in that code and the other governors to support it.
> 
> However, after making it possible to set the CPU frequency from
> the utilization update handlers, that new governor's interactions
> with the common code might be limited to the initialization, cleanup
> and handling of sysfs attributes (currently only one attribute,
> sampling_rate, is supported in addition to the standard policy
> attributes handled by the cpufreq core).
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> 
> Changes from v2:
> - Avoid requesting the same frequency that was requested last time for
>   the given policy.
> 
> Changes from v1:
> - Use policy->min and policy->max as min/max frequency in computations.
> 

Well, this is getting interesting. :-)

Some preliminary results from SpecPower indicate that this governor (with
patch [2/2] on top), as simple as it is, beats "ondemand" in performance/power
and generally allows the system to achieve better performance.  The difference
is not significant, but measurable.

If that is confirmed in further testing, I will be inclined to drop this thing
in in the next cycle.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-02-25 21:14     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
  2016-02-27  0:21       ` Rafael J. Wysocki
@ 2016-02-27  4:33       ` Steve Muckle
  2016-02-27 15:24         ` Rafael J. Wysocki
  1 sibling, 1 reply; 32+ messages in thread
From: Steve Muckle @ 2016-02-27  4:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM list
  Cc: Juri Lelli, Linux Kernel Mailing List, Viresh Kumar,
	Srinivas Pandruvada, Peter Zijlstra, Ingo Molnar

On 02/25/2016 01:14 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
> 
> The new governor is very simple.  It is almost as simple as it
> can be and remain reasonably functional.
> 
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
> 
> All of the computations are carried out in the utilization update
> handlers provided by the new governor.  One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).  The only operation carried
> out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> is a __cpufreq_driver_target() call to trigger a frequency update (to
> a value already computed beforehand in one of the utilization update
> handlers).  This means that, at least for some cpufreq drivers that
> can update CPU frequency by doing simple register writes, it should
> be possible to set the frequency in the utilization update handlers
> too in which case all of the governor's activity would take place in
> the scheduler paths invoking cpufreq_update_util() without the need
> to run anything in process context.
> 
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes.  That
> heavy-handed approach should be replaced with something more
> specifically targeted at RT and DL tasks.
> 
> To some extent it relies on the common governor code in
> cpufreq_governor.c and it uses that code in a somewhat unusual
> way (different from what the "ondemand" and "conservative"
> governors do), so some small and rather unintrusive changes
> have to be made in that code and the other governors to support it.
> 
> However, after making it possible to set the CPU frequency from
> the utilization update handlers, that new governor's interactions
> with the common code might be limited to the initialization, cleanup
> and handling of sysfs attributes (currently only one attribute,
> sampling_rate, is supported in addition to the standard policy
> attributes handled by the cpufreq core).

We'll still need a slow path for platforms that don't support fast
transitions right?

Or is this referring to plans to add an RT thread for slowpath changes
instead of using the dbs stuff :) .

> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> 
> Changes from v2:
> - Avoid requesting the same frequency that was requested last time for
>   the given policy.
> 
> Changes from v1:
> - Use policy->min and policy->max as min/max frequency in computations.
> 
> ---
>  drivers/cpufreq/Kconfig                |   15 +
>  drivers/cpufreq/Makefile               |    1 
>  drivers/cpufreq/cpufreq_conservative.c |    3 
>  drivers/cpufreq/cpufreq_governor.c     |   21 +-
>  drivers/cpufreq/cpufreq_governor.h     |    2 
>  drivers/cpufreq/cpufreq_ondemand.c     |    3 
>  drivers/cpufreq/cpufreq_schedutil.c    |  253 +++++++++++++++++++++++++++++++++
>  7 files changed, 288 insertions(+), 10 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> @@ -164,7 +164,7 @@ struct dbs_governor {
>  	void (*free)(struct policy_dbs_info *policy_dbs);
>  	int (*init)(struct dbs_data *dbs_data, bool notify);
>  	void (*exit)(struct dbs_data *dbs_data, bool notify);
> -	void (*start)(struct cpufreq_policy *policy);
> +	bool (*start)(struct cpufreq_policy *policy);
>  };
>  
>  static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
> Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> @@ -0,0 +1,253 @@
> +/*
> + * CPUFreq governor based on scheduler-provided CPU utilization data.
> + *
> + * Copyright (C) 2016, Intel Corporation
> + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/percpu-defs.h>
> +#include <linux/slab.h>
> +
> +#include "cpufreq_governor.h"
> +
> +struct sugov_policy {
> +	struct policy_dbs_info policy_dbs;
> +	unsigned int next_freq;
> +	raw_spinlock_t update_lock;  /* For shared policies */
> +};
> +
> +static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
> +{
> +	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
> +}
> +
> +struct sugov_cpu {
> +	struct update_util_data update_util;
> +	struct policy_dbs_info *policy_dbs;
> +	/* The fields below are only needed when sharing a policy. */
> +	unsigned long util;
> +	unsigned long max;
> +	u64 last_update;
> +};
> +
> +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> +
> +/************************ Governor internals ***********************/
> +
> +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
> +				    unsigned long util, unsigned long max,
> +				    u64 last_sample_time)
> +{
> +	struct cpufreq_policy *policy = policy_dbs->policy;
> +	unsigned int min_f = policy->min;
> +	unsigned int max_f = policy->max;
> +	unsigned int j;
> +
> +	if (util > max || min_f >= max_f)
> +		return max_f;
> +
> +	for_each_cpu(j, policy->cpus) {
> +		struct sugov_cpu *j_sg_cpu;
> +		unsigned long j_util, j_max;
> +
> +		if (j == smp_processor_id())
> +			continue;
> +
> +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> +		/*
> +		 * If the CPU was last updated before the previous sampling
> +		 * time, we don't take it into account as it probably is idle
> +		 * now.
> +		 */

If the sampling rate is less than the tick, it seems possible a busy CPU
may not manage to get an update in within the current sampling period.

What about checking to see if there was an update within the last tick
period, rather than sample period?

There's also NO_HZ_FULL. I don't know that anyone using NO_HZ_FULL would
want to use a governor other than performance, but I thought I'd mention
it just in case.

> +		if (j_sg_cpu->last_update < last_sample_time)
> +			continue;
> +
> +		j_util = j_sg_cpu->util;
> +		j_max = j_sg_cpu->max;
> +		if (j_util > j_max)
> +			return max_f;
> +
> +		if (j_util * max > j_max * util) {
> +			util = j_util;
> +			max = j_max;
> +		}
> +	}
> +
> +	return min_f + util * (max_f - min_f) / max;
> +}
> +
> +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> +				unsigned int next_freq)
> +{
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +
> +	policy_dbs->last_sample_time = time;
> +	if (sg_policy->next_freq != next_freq) {
> +		sg_policy->next_freq = next_freq;
> +		policy_dbs->work_in_progress = true;
> +		irq_work_queue(&policy_dbs->irq_work);
> +	}
> +}
> +
> +static void sugov_update_shared(struct update_util_data *data, u64 time,
> +				unsigned long util, unsigned long max)
> +{
> +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +	unsigned int next_f;
> +	u64 delta_ns, lst;
> +
> +	raw_spin_lock(&sg_policy->update_lock);
> +
> +	sg_cpu->util = util;
> +	sg_cpu->max = max;
> +	sg_cpu->last_update = time;
> +
> +	if (policy_dbs->work_in_progress)
> +		goto out;
> +
> +	/*
> +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> +	 */
> +	lst = policy_dbs->last_sample_time;
> +	delta_ns = time - lst;
> +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> +		goto out;
> +
> +	next_f = sugov_next_freq(policy_dbs, util, max, lst);
> +
> +	sugov_update_commit(policy_dbs, time, next_f);
> +
> + out:
> +	raw_spin_unlock(&sg_policy->update_lock);
> +}
> +
> +static void sugov_update_single(struct update_util_data *data, u64 time,
> +				unsigned long util, unsigned long max)
> +{
> +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> +	unsigned int min_f, max_f, next_f;
> +	u64 delta_ns;
> +
> +	if (policy_dbs->work_in_progress)
> +		return;
> +
> +	/*
> +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> +	 */
> +	delta_ns = time - policy_dbs->last_sample_time;
> +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> +		return;
> +
> +	min_f = policy_dbs->policy->min;
> +	max_f = policy_dbs->policy->max;
> +	next_f = util > max || min_f >= max_f ? max_f :
> +			min_f + util * (max_f - min_f) / max;
> +
> +	sugov_update_commit(policy_dbs, time, next_f);
> +}
> +
> +/************************** sysfs interface ************************/
> +
> +gov_show_one_common(sampling_rate);
> +
> +gov_attr_rw(sampling_rate);
> +
> +static struct attribute *sugov_attributes[] = {
> +	&sampling_rate.attr,
> +	NULL
> +};
> +
> +/********************** dbs_governor interface *********************/
> +
> +static struct policy_dbs_info *sugov_alloc(void)
> +{
> +	struct sugov_policy *sg_policy;
> +
> +	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> +	if (!sg_policy)
> +		return NULL;
> +
> +	raw_spin_lock_init(&sg_policy->update_lock);
> +	return &sg_policy->policy_dbs;
> +}
> +
> +static void sugov_free(struct policy_dbs_info *policy_dbs)
> +{
> +	kfree(to_sg_policy(policy_dbs));

Is it possible that a CPU could still be in the sugov_update_* path
above when this runs? I see that cpufreq_governor_stop will call
gov_cancel_work which clears the update_util hook, but I don't see
anything that synchronizes with all CPUs having exited the update path.

> +}
> +
> +static bool sugov_start(struct cpufreq_policy *policy)
> +{
> +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +	unsigned int cpu;
> +
> +	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
> +	policy_dbs->last_sample_time = 0;
> +	sg_policy->next_freq = UINT_MAX;
> +
> +	for_each_cpu(cpu, policy->cpus) {
> +		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> +
> +		sg_cpu->policy_dbs = policy_dbs;
> +		if (policy_dbs->is_shared) {
> +			sg_cpu->util = ULONG_MAX;
> +			sg_cpu->max = 0;
> +			sg_cpu->last_update = 0;
> +			sg_cpu->update_util.func = sugov_update_shared;
> +		} else {
> +			sg_cpu->update_util.func = sugov_update_single;
> +		}
> +		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
> +	}
> +	return false;
> +}
> +
> +static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
> +{
> +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +
> +	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);

Similar to the concern raised in the acpi changes I think this should be
CPUFREQ_RELATION_L, otherwise you may not run fast enough to meet demand.

> +	return policy_dbs->dbs_data->sampling_rate;
> +}
> +
> +static struct dbs_governor schedutil_gov = {
> +	.gov = {
> +		.name = "schedutil",
> +		.governor = cpufreq_governor_dbs,
> +		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
> +		.owner = THIS_MODULE,
> +	},
> +	.kobj_type = { .default_attrs = sugov_attributes },
> +	.gov_dbs_timer = sugov_set_freq,
> +	.alloc = sugov_alloc,
> +	.free = sugov_free,
> +	.start = sugov_start,
> +};
> +
> +#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
> +
> +static int __init sugov_init(void)
> +{
> +	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
> +}
> +
> +static void __exit sugov_exit(void)
> +{
> +	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
> +}
> +
> +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> +MODULE_LICENSE("GPL");
> +
> +module_init(sugov_init);
> +module_exit(sugov_exit);
> Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
> +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
> @@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
>  	kfree(dbs_data->tuners);
>  }
>  
> -static void cs_start(struct cpufreq_policy *policy)
> +static bool cs_start(struct cpufreq_policy *policy)
>  {
>  	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
>  
>  	dbs_info->down_skip = 0;
>  	dbs_info->requested_freq = policy->cur;
> +	return true;
>  }
>  
>  static struct dbs_governor cs_dbs_gov = {
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -472,9 +472,11 @@ static int cpufreq_governor_init(struct
>  	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
>  	mutex_init(&dbs_data->mutex);
>  
> -	ret = gov->init(dbs_data, !policy->governor->initialized);
> -	if (ret)
> -		goto free_policy_dbs_info;
> +	if (gov->init) {
> +		ret = gov->init(dbs_data, !policy->governor->initialized);
> +		if (ret)
> +			goto free_policy_dbs_info;
> +	}
>  
>  	/* policy latency is in ns. Convert it to us first */
>  	latency = policy->cpuinfo.transition_latency / 1000;
> @@ -510,7 +512,10 @@ static int cpufreq_governor_init(struct
>  
>  	if (!have_governor_per_policy())
>  		gov->gdbs_data = NULL;
> -	gov->exit(dbs_data, !policy->governor->initialized);
> +
> +	if (gov->exit)
> +		gov->exit(dbs_data, !policy->governor->initialized);
> +
>  	kfree(dbs_data);
>  
>  free_policy_dbs_info:
> @@ -544,7 +549,9 @@ static int cpufreq_governor_exit(struct
>  		if (!have_governor_per_policy())
>  			gov->gdbs_data = NULL;
>  
> -		gov->exit(dbs_data, policy->governor->initialized == 1);
> +		if (gov->exit)
> +			gov->exit(dbs_data, policy->governor->initialized == 1);
> +
>  		mutex_destroy(&dbs_data->mutex);
>  		kfree(dbs_data);
>  	} else {
> @@ -588,9 +595,9 @@ static int cpufreq_governor_start(struct
>  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
>  	}
>  
> -	gov->start(policy);
> +	if (gov->start(policy))
> +		gov_set_update_util(policy_dbs, sampling_rate);
>  
> -	gov_set_update_util(policy_dbs, sampling_rate);
>  	return 0;
>  }
>  
> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
> +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> @@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
>  	kfree(dbs_data->tuners);
>  }
>  
> -static void od_start(struct cpufreq_policy *policy)
> +static bool od_start(struct cpufreq_policy *policy)
>  {
>  	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
>  
>  	dbs_info->sample_type = OD_NORMAL_SAMPLE;
>  	ondemand_powersave_bias_init(policy);
> +	return true;
>  }
>  
>  static struct od_ops od_ops = {
> Index: linux-pm/drivers/cpufreq/Kconfig
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/Kconfig
> +++ linux-pm/drivers/cpufreq/Kconfig
> @@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
>  
>  	  If in doubt, say N.
>  
> +config CPU_FREQ_GOV_SCHEDUTIL
> +	tristate "'schedutil' cpufreq policy governor"
> +	depends on CPU_FREQ

do you need to select IRQ_WORK here

> +	select CPU_FREQ_GOV_COMMON
> +	help
> +	  The frequency selection formula used by this governor is analogous
> +	  to the one used by 'ondemand', but instead of computing CPU load
> +	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
> +	  utilization data provided by the scheduler as input.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called cpufreq_conservative.
> +
> +	  If in doubt, say N.
> +
>  comment "CPU frequency scaling drivers"
>  
>  config CPUFREQ_DT
> Index: linux-pm/drivers/cpufreq/Makefile
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/Makefile
> +++ linux-pm/drivers/cpufreq/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
>  obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
>  obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
>  obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
>  obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
>  
>  obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
> 

thanks,
Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-02-27  4:33       ` Steve Muckle
@ 2016-02-27 15:24         ` Rafael J. Wysocki
  2016-03-01  4:10           ` Steve Muckle
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-02-27 15:24 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Ingo Molnar

On Friday, February 26, 2016 08:33:19 PM Steve Muckle wrote:
> On 02/25/2016 01:14 PM, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > Add a new cpufreq scaling governor, called "schedutil", that uses
> > scheduler-provided CPU utilization information as input for making
> > its decisions.
> > 
> > Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> > mechanism for registering utilization update callbacks) that
> > introduced cpufreq_update_util() called by the scheduler on
> > utilization changes (from CFS) and RT/DL task status updates.
> > In particular, CPU frequency scaling decisions may be based on
> > the the utilization data passed to cpufreq_update_util() by CFS.
> > 
> > The new governor is very simple.  It is almost as simple as it
> > can be and remain reasonably functional.
> > 
> > The frequency selection formula used by it is essentially the same
> > as the one used by the "ondemand" governor, although it doesn't use
> > the additional up_threshold parameter, but instead of computing the
> > load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> > the utilization data provided by CFS as input.  More specifically,
> > it represents "load" as the util/max ratio, where util and max
> > are the utilization and CPU capacity coming from CFS.
> > 
> > All of the computations are carried out in the utilization update
> > handlers provided by the new governor.  One of those handlers is
> > used for cpufreq policies shared between multiple CPUs and the other
> > one is for policies with one CPU only (and therefore it doesn't need
> > to use any extra synchronization means).  The only operation carried
> > out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> > is a __cpufreq_driver_target() call to trigger a frequency update (to
> > a value already computed beforehand in one of the utilization update
> > handlers).  This means that, at least for some cpufreq drivers that
> > can update CPU frequency by doing simple register writes, it should
> > be possible to set the frequency in the utilization update handlers
> > too in which case all of the governor's activity would take place in
> > the scheduler paths invoking cpufreq_update_util() without the need
> > to run anything in process context.
> > 
> > Currently, the governor treats all of the RT and DL tasks as
> > "unknown utilization" and sets the frequency to the allowed
> > maximum when updated from the RT or DL sched classes.  That
> > heavy-handed approach should be replaced with something more
> > specifically targeted at RT and DL tasks.
> > 
> > To some extent it relies on the common governor code in
> > cpufreq_governor.c and it uses that code in a somewhat unusual
> > way (different from what the "ondemand" and "conservative"
> > governors do), so some small and rather unintrusive changes
> > have to be made in that code and the other governors to support it.
> > 
> > However, after making it possible to set the CPU frequency from
> > the utilization update handlers, that new governor's interactions
> > with the common code might be limited to the initialization, cleanup
> > and handling of sysfs attributes (currently only one attribute,
> > sampling_rate, is supported in addition to the standard policy
> > attributes handled by the cpufreq core).
> 
> We'll still need a slow path for platforms that don't support fast
> transitions right?

Right.

> Or is this referring to plans to add an RT thread for slowpath changes
> instead of using the dbs stuff :) .

I did not consider that when I was working on the $subject patch. :-)

> > 
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > 
> > There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> > 
> > Changes from v2:
> > - Avoid requesting the same frequency that was requested last time for
> >   the given policy.
> > 
> > Changes from v1:
> > - Use policy->min and policy->max as min/max frequency in computations.
> > 
> > ---
> >  drivers/cpufreq/Kconfig                |   15 +
> >  drivers/cpufreq/Makefile               |    1 
> >  drivers/cpufreq/cpufreq_conservative.c |    3 
> >  drivers/cpufreq/cpufreq_governor.c     |   21 +-
> >  drivers/cpufreq/cpufreq_governor.h     |    2 
> >  drivers/cpufreq/cpufreq_ondemand.c     |    3 
> >  drivers/cpufreq/cpufreq_schedutil.c    |  253 +++++++++++++++++++++++++++++++++
> >  7 files changed, 288 insertions(+), 10 deletions(-)
> > 
> > Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> > +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> > @@ -164,7 +164,7 @@ struct dbs_governor {
> >  	void (*free)(struct policy_dbs_info *policy_dbs);
> >  	int (*init)(struct dbs_data *dbs_data, bool notify);
> >  	void (*exit)(struct dbs_data *dbs_data, bool notify);
> > -	void (*start)(struct cpufreq_policy *policy);
> > +	bool (*start)(struct cpufreq_policy *policy);
> >  };
> >  
> >  static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
> > Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> > @@ -0,0 +1,253 @@
> > +/*
> > + * CPUFreq governor based on scheduler-provided CPU utilization data.
> > + *
> > + * Copyright (C) 2016, Intel Corporation
> > + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/percpu-defs.h>
> > +#include <linux/slab.h>
> > +
> > +#include "cpufreq_governor.h"
> > +
> > +struct sugov_policy {
> > +	struct policy_dbs_info policy_dbs;
> > +	unsigned int next_freq;
> > +	raw_spinlock_t update_lock;  /* For shared policies */
> > +};
> > +
> > +static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
> > +{
> > +	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
> > +}
> > +
> > +struct sugov_cpu {
> > +	struct update_util_data update_util;
> > +	struct policy_dbs_info *policy_dbs;
> > +	/* The fields below are only needed when sharing a policy. */
> > +	unsigned long util;
> > +	unsigned long max;
> > +	u64 last_update;
> > +};
> > +
> > +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > +
> > +/************************ Governor internals ***********************/
> > +
> > +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
> > +				    unsigned long util, unsigned long max,
> > +				    u64 last_sample_time)
> > +{
> > +	struct cpufreq_policy *policy = policy_dbs->policy;
> > +	unsigned int min_f = policy->min;
> > +	unsigned int max_f = policy->max;
> > +	unsigned int j;
> > +
> > +	if (util > max || min_f >= max_f)
> > +		return max_f;
> > +
> > +	for_each_cpu(j, policy->cpus) {
> > +		struct sugov_cpu *j_sg_cpu;
> > +		unsigned long j_util, j_max;
> > +
> > +		if (j == smp_processor_id())
> > +			continue;
> > +
> > +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> > +		/*
> > +		 * If the CPU was last updated before the previous sampling
> > +		 * time, we don't take it into account as it probably is idle
> > +		 * now.
> > +		 */
> 
> If the sampling rate is less than the tick, it seems possible a busy CPU
> may not manage to get an update in within the current sampling period.

Right.

sampling_rate is more of a rate limit here, though, because the governor doesn't
really do any sampling (I was talking about that in the intro message at
http://marc.info/?l=linux-pm&m=145609673008122&w=2).

It was easy to re-use the existing show/store for that, so I took the shortcut,
but I'm considering to introduce a new attribute with a more suitable name for
that.  That would help to avoid a couple of unclean things in patch [2/2] as
well if I'm not mistaken.

> What about checking to see if there was an update within the last tick
> period, rather than sample period?

If your rate limit is set below the rate of the updates (scheduling rate more
or less), it simply has no effect.  To me, that case doesn't require any
special care.

> There's also NO_HZ_FULL. I don't know that anyone using NO_HZ_FULL would
> want to use a governor other than performance, but I thought I'd mention
> it just in case.

Yup.  That's broken already with anything different from "performance".

I think we should fix it, but it can be taken care of later.  Plus that's one
of the things that need to be fixed in general rather than for a single
governor.

> > +		if (j_sg_cpu->last_update < last_sample_time)
> > +			continue;
> > +
> > +		j_util = j_sg_cpu->util;
> > +		j_max = j_sg_cpu->max;
> > +		if (j_util > j_max)
> > +			return max_f;
> > +
> > +		if (j_util * max > j_max * util) {
> > +			util = j_util;
> > +			max = j_max;
> > +		}
> > +	}
> > +
> > +	return min_f + util * (max_f - min_f) / max;
> > +}
> > +
> > +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> > +				unsigned int next_freq)
> > +{
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +
> > +	policy_dbs->last_sample_time = time;
> > +	if (sg_policy->next_freq != next_freq) {
> > +		sg_policy->next_freq = next_freq;
> > +		policy_dbs->work_in_progress = true;
> > +		irq_work_queue(&policy_dbs->irq_work);
> > +	}
> > +}
> > +
> > +static void sugov_update_shared(struct update_util_data *data, u64 time,
> > +				unsigned long util, unsigned long max)
> > +{
> > +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> > +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +	unsigned int next_f;
> > +	u64 delta_ns, lst;
> > +
> > +	raw_spin_lock(&sg_policy->update_lock);
> > +
> > +	sg_cpu->util = util;
> > +	sg_cpu->max = max;
> > +	sg_cpu->last_update = time;
> > +
> > +	if (policy_dbs->work_in_progress)
> > +		goto out;
> > +
> > +	/*
> > +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> > +	 */
> > +	lst = policy_dbs->last_sample_time;
> > +	delta_ns = time - lst;
> > +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> > +		goto out;
> > +
> > +	next_f = sugov_next_freq(policy_dbs, util, max, lst);
> > +
> > +	sugov_update_commit(policy_dbs, time, next_f);
> > +
> > + out:
> > +	raw_spin_unlock(&sg_policy->update_lock);
> > +}
> > +
> > +static void sugov_update_single(struct update_util_data *data, u64 time,
> > +				unsigned long util, unsigned long max)
> > +{
> > +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> > +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> > +	unsigned int min_f, max_f, next_f;
> > +	u64 delta_ns;
> > +
> > +	if (policy_dbs->work_in_progress)
> > +		return;
> > +
> > +	/*
> > +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> > +	 */
> > +	delta_ns = time - policy_dbs->last_sample_time;
> > +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> > +		return;
> > +
> > +	min_f = policy_dbs->policy->min;
> > +	max_f = policy_dbs->policy->max;
> > +	next_f = util > max || min_f >= max_f ? max_f :
> > +			min_f + util * (max_f - min_f) / max;
> > +
> > +	sugov_update_commit(policy_dbs, time, next_f);
> > +}
> > +
> > +/************************** sysfs interface ************************/
> > +
> > +gov_show_one_common(sampling_rate);
> > +
> > +gov_attr_rw(sampling_rate);
> > +
> > +static struct attribute *sugov_attributes[] = {
> > +	&sampling_rate.attr,
> > +	NULL
> > +};
> > +
> > +/********************** dbs_governor interface *********************/
> > +
> > +static struct policy_dbs_info *sugov_alloc(void)
> > +{
> > +	struct sugov_policy *sg_policy;
> > +
> > +	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> > +	if (!sg_policy)
> > +		return NULL;
> > +
> > +	raw_spin_lock_init(&sg_policy->update_lock);
> > +	return &sg_policy->policy_dbs;
> > +}
> > +
> > +static void sugov_free(struct policy_dbs_info *policy_dbs)
> > +{
> > +	kfree(to_sg_policy(policy_dbs));
> 
> Is it possible that a CPU could still be in the sugov_update_* path
> above when this runs?

No, it isn't.

> I see that cpufreq_governor_stop will call
> gov_cancel_work which clears the update_util hook, but I don't see
> anything that synchronizes with all CPUs having exited the update path.

gov_cancel_work() calls gov_clear_update_util() which clears the update_util
pointers for all policy CPUs and invokes synchronize_rcu() (or synchronize_sched()
with my latest patch applied: https://patchwork.kernel.org/patch/8443191/).

This guarantees that sugov_update_* won't be running on any of those CPUs
when that has returned.

> > +}
> > +
> > +static bool sugov_start(struct cpufreq_policy *policy)
> > +{
> > +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +	unsigned int cpu;
> > +
> > +	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
> > +	policy_dbs->last_sample_time = 0;
> > +	sg_policy->next_freq = UINT_MAX;
> > +
> > +	for_each_cpu(cpu, policy->cpus) {
> > +		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> > +
> > +		sg_cpu->policy_dbs = policy_dbs;
> > +		if (policy_dbs->is_shared) {
> > +			sg_cpu->util = ULONG_MAX;
> > +			sg_cpu->max = 0;
> > +			sg_cpu->last_update = 0;
> > +			sg_cpu->update_util.func = sugov_update_shared;
> > +		} else {
> > +			sg_cpu->update_util.func = sugov_update_single;
> > +		}
> > +		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
> > +	}
> > +	return false;
> > +}
> > +
> > +static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
> > +{
> > +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +
> > +	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
> 
> Similar to the concern raised in the acpi changes I think this should be
> CPUFREQ_RELATION_L, otherwise you may not run fast enough to meet demand.

"ondemand" uses CPUFREQ_RELATION_C when powersave_bias is unset and that's why
I used it here.

Like I said in my reply to Peter in that thread, using RELATION_L here is likely
to make us avoid the min frequency almost entirely even if the system is almost
completely idle.  I don't think that would be OK really.

That said my opinion about this particular item isn't really strong.

> > +	return policy_dbs->dbs_data->sampling_rate;
> > +}
> > +
> > +static struct dbs_governor schedutil_gov = {
> > +	.gov = {
> > +		.name = "schedutil",
> > +		.governor = cpufreq_governor_dbs,
> > +		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
> > +		.owner = THIS_MODULE,
> > +	},
> > +	.kobj_type = { .default_attrs = sugov_attributes },
> > +	.gov_dbs_timer = sugov_set_freq,
> > +	.alloc = sugov_alloc,
> > +	.free = sugov_free,
> > +	.start = sugov_start,
> > +};
> > +
> > +#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
> > +
> > +static int __init sugov_init(void)
> > +{
> > +	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
> > +}
> > +
> > +static void __exit sugov_exit(void)
> > +{
> > +	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
> > +}
> > +
> > +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> > +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> > +MODULE_LICENSE("GPL");
> > +
> > +module_init(sugov_init);
> > +module_exit(sugov_exit);
> > Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
> > @@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
> >  	kfree(dbs_data->tuners);
> >  }
> >  
> > -static void cs_start(struct cpufreq_policy *policy)
> > +static bool cs_start(struct cpufreq_policy *policy)
> >  {
> >  	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
> >  
> >  	dbs_info->down_skip = 0;
> >  	dbs_info->requested_freq = policy->cur;
> > +	return true;
> >  }
> >  
> >  static struct dbs_governor cs_dbs_gov = {
> > Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> > @@ -472,9 +472,11 @@ static int cpufreq_governor_init(struct
> >  	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
> >  	mutex_init(&dbs_data->mutex);
> >  
> > -	ret = gov->init(dbs_data, !policy->governor->initialized);
> > -	if (ret)
> > -		goto free_policy_dbs_info;
> > +	if (gov->init) {
> > +		ret = gov->init(dbs_data, !policy->governor->initialized);
> > +		if (ret)
> > +			goto free_policy_dbs_info;
> > +	}
> >  
> >  	/* policy latency is in ns. Convert it to us first */
> >  	latency = policy->cpuinfo.transition_latency / 1000;
> > @@ -510,7 +512,10 @@ static int cpufreq_governor_init(struct
> >  
> >  	if (!have_governor_per_policy())
> >  		gov->gdbs_data = NULL;
> > -	gov->exit(dbs_data, !policy->governor->initialized);
> > +
> > +	if (gov->exit)
> > +		gov->exit(dbs_data, !policy->governor->initialized);
> > +
> >  	kfree(dbs_data);
> >  
> >  free_policy_dbs_info:
> > @@ -544,7 +549,9 @@ static int cpufreq_governor_exit(struct
> >  		if (!have_governor_per_policy())
> >  			gov->gdbs_data = NULL;
> >  
> > -		gov->exit(dbs_data, policy->governor->initialized == 1);
> > +		if (gov->exit)
> > +			gov->exit(dbs_data, policy->governor->initialized == 1);
> > +
> >  		mutex_destroy(&dbs_data->mutex);
> >  		kfree(dbs_data);
> >  	} else {
> > @@ -588,9 +595,9 @@ static int cpufreq_governor_start(struct
> >  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
> >  	}
> >  
> > -	gov->start(policy);
> > +	if (gov->start(policy))
> > +		gov_set_update_util(policy_dbs, sampling_rate);
> >  
> > -	gov_set_update_util(policy_dbs, sampling_rate);
> >  	return 0;
> >  }
> >  
> > Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> > @@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
> >  	kfree(dbs_data->tuners);
> >  }
> >  
> > -static void od_start(struct cpufreq_policy *policy)
> > +static bool od_start(struct cpufreq_policy *policy)
> >  {
> >  	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
> >  
> >  	dbs_info->sample_type = OD_NORMAL_SAMPLE;
> >  	ondemand_powersave_bias_init(policy);
> > +	return true;
> >  }
> >  
> >  static struct od_ops od_ops = {
> > Index: linux-pm/drivers/cpufreq/Kconfig
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/Kconfig
> > +++ linux-pm/drivers/cpufreq/Kconfig
> > @@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
> >  
> >  	  If in doubt, say N.
> >  
> > +config CPU_FREQ_GOV_SCHEDUTIL
> > +	tristate "'schedutil' cpufreq policy governor"
> > +	depends on CPU_FREQ
> 
> do you need to select IRQ_WORK here

No, I don't.

It already is selected by CPU_FREQ in linux-next, but it really should be
selected by CPU_FREQ_GOV_COMMON, so let me cut a patch to make that change.

> > +	select CPU_FREQ_GOV_COMMON
> > +	help
> > +	  The frequency selection formula used by this governor is analogous
> > +	  to the one used by 'ondemand', but instead of computing CPU load
> > +	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
> > +	  utilization data provided by the scheduler as input.
> > +
> > +	  To compile this driver as a module, choose M here: the
> > +	  module will be called cpufreq_conservative.
> > +
> > +	  If in doubt, say N.
> > +
> >  comment "CPU frequency scaling drivers"
> >  
> >  config CPUFREQ_DT
> > Index: linux-pm/drivers/cpufreq/Makefile
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/Makefile
> > +++ linux-pm/drivers/cpufreq/Makefile
> > @@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
> >  obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
> >  obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
> >  obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
> > +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
> >  obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
> >  
> >  obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
> > 

Thanks for your comments,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-02-27 15:24         ` Rafael J. Wysocki
@ 2016-03-01  4:10           ` Steve Muckle
  2016-03-01 20:20             ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Muckle @ 2016-03-01  4:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Ingo Molnar

On 02/27/2016 07:24 AM, Rafael J. Wysocki wrote:
>>> +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
>>> +				    unsigned long util, unsigned long max,
>>> +				    u64 last_sample_time)
>>> +{
>>> +	struct cpufreq_policy *policy = policy_dbs->policy;
>>> +	unsigned int min_f = policy->min;
>>> +	unsigned int max_f = policy->max;
>>> +	unsigned int j;
>>> +
>>> +	if (util > max || min_f >= max_f)
>>> +		return max_f;
>>> +
>>> +	for_each_cpu(j, policy->cpus) {
>>> +		struct sugov_cpu *j_sg_cpu;
>>> +		unsigned long j_util, j_max;
>>> +
>>> +		if (j == smp_processor_id())
>>> +			continue;
>>> +
>>> +		j_sg_cpu = &per_cpu(sugov_cpu, j);
>>> +		/*
>>> +		 * If the CPU was last updated before the previous sampling
>>> +		 * time, we don't take it into account as it probably is idle
>>> +		 * now.
>>> +		 */
>>
>> If the sampling rate is less than the tick, it seems possible a busy CPU
>> may not manage to get an update in within the current sampling period.
> 
> Right.
> 
> sampling_rate is more of a rate limit here, though, because the governor doesn't
> really do any sampling (I was talking about that in the intro message at
> http://marc.info/?l=linux-pm&m=145609673008122&w=2).
> 
> It was easy to re-use the existing show/store for that, so I took the shortcut,
> but I'm considering to introduce a new attribute with a more suitable name for
> that.  That would help to avoid a couple of unclean things in patch [2/2] as
> well if I'm not mistaken.
> 
>> What about checking to see if there was an update within the last tick
>> period, rather than sample period?
> 
> If your rate limit is set below the rate of the updates (scheduling rate more
> or less), it simply has no effect.  To me, that case doesn't require any
> special care.

I'm specifically worried about the check below where we omit a CPU's
capacity request if its last update came before the last sample time.

Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
delay here is 4ms. At t=0ms CPU0 starts running a CPU hog and reports
being fully busy. At t=5ms CPU1 starts running a task - CPU0's vote is
kept and factored in and last_sample_time is updated to t=5ms. At t=9ms
CPU1 stops running the task and updates again, but this time CPU0's vote
is discarded because its last update is t=0ms, and last_sample_time is
5ms. But CPU0 is fully busy and the freq is erroneously dropped.

> 
>>> +		if (j_sg_cpu->last_update < last_sample_time)
>>> +			continue;
>>> +
>>> +		j_util = j_sg_cpu->util;
>>> +		j_max = j_sg_cpu->max;
>>> +		if (j_util > j_max)
>>> +			return max_f;
>>> +
>>> +		if (j_util * max > j_max * util) {
>>> +			util = j_util;
>>> +			max = j_max;
>>> +		}
>>> +	}
>>> +
>>> +	return min_f + util * (max_f - min_f) / max;
>>> +}
...
>>> +static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
>>> +{
>>> +	struct policy_dbs_info *policy_dbs = policy->governor_data;
>>> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
>>> +
>>> +	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
>>
>> Similar to the concern raised in the acpi changes I think this should be
>> CPUFREQ_RELATION_L, otherwise you may not run fast enough to meet demand.
> 
> "ondemand" uses CPUFREQ_RELATION_C when powersave_bias is unset and that's why
> I used it here.
> 
> Like I said in my reply to Peter in that thread, using RELATION_L here is likely
> to make us avoid the min frequency almost entirely even if the system is almost
> completely idle.  I don't think that would be OK really.
> 
> That said my opinion about this particular item isn't really strong.

I think the calculation for required CPU bandwidth needs tweaking.

Aside from always wanting something past fmin, currently the amount of
extra CPU capacity given for a particular % utilization depends on how
high the platform's fmin happens to be, even if the fmax speeds are the
same. For example given two platforms with the following available
frequencies (MHz):

platform A: 100, 300, 500, 700, 900, 1100
platform B: 500, 700, 900, 1100

A 50% utilization load on platform A will want 600 MHz (rounding up to
700 MHz perhaps) whereas platform B will want 800 MHz (again likely
rounding up to 900 MHz), even though the load consumes 550 MHz on both
platforms.

One possibility would be something like we had in schedfreq, getting the
absolute CPU bw requirement (util/max) * fmax and then adding some %
margin, which I think is more consistent. It is true that it means
figuring out what the right margin is and now there's a magic number
(and potentially a tunable), but it would be more consistent.

thanks,
Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-02-26  2:36         ` Rafael J. Wysocki
@ 2016-03-01 14:56           ` Juri Lelli
  2016-03-01 20:26             ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Lelli @ 2016-03-01 14:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Steve Muckle,
	Ingo Molnar

On 26/02/16 03:36, Rafael J. Wysocki wrote:
> On Thursday, February 25, 2016 11:01:20 AM Juri Lelli wrote:

[...]

> > 
> > That is right. But, can't an higher priority class eat all the needed
> > capacity. I mean, suppose that both CFS and DL need 30% of CPU capacity
> > on the same CPU. DL wins and gets its 30% of capacity. When CFS gets to
> > run it's too late for requesting anything more (w.r.t. the same time
> > window). If we somehow aggregate requests instead, we could request 60%
> > and both classes can have their capacity to run. It seems to me that
> > this is what governors were already doing by using the 1 - idle metric.
> 
> That's interesting, because it is about a few different things at a time. :-)
> 
> So first of all the "old" governors only collect information about what
> happened in the past and make decisions on that basis (kind of in the hope
> that what happened once will happen again), while the idea behind what
> you're describing seems to be to attempt to project future needs for
> capacity and use that to make decisions (just for the very near future,
> but that should be sufficient).  If successful, that would be the most
> suitable approach IMO.
> 

Right, this is a key difference.

> Of course, the $subject patch is not aspiring to anything of that kind.
> It only uses information about current needs that's already available to
> it in a very straightforward way.
> 

But, using utilization of CFS tasks (based on PELT) has already some
notion of "future needs" (even if it is true that tasks might have
phases). And this will be true for DL as well, once we will have a
corresponding utilization signal that we can consume. I think you are
already consuming information about the future in some sense. :-)

> But there's more to it.  In the sampling, or rate-limiting if you will,
> situation you really have a window in which many things can happen and
> making a good decision at the beginning of it is important.  However, if
> you just can handle *every* request and really switch frequencies on the
> fly, then each of them may come with a "currently needed capacity" number
> and you can just give it what it asks for every time.
> 

True. Rate-limiting poses interesting problems.

> My point is that there are quite a few things to consider here and I'm
> expecting a learning process to happen before we are happy with what we
> have.  So my approach would be (and is) to start very simple and then
> add more complexity over time as needed instead of just trying to address
> every issue I can think about from the outset.
> 

I perfectly understand that, and I agree that there is value in starting
simple. I simply fear that aggregation of utilization signals will be one
of the few things that will pop out fairly soon. :-)

Best,

- Juri

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-03-01  4:10           ` Steve Muckle
@ 2016-03-01 20:20             ` Rafael J. Wysocki
  2016-03-03  3:20               ` Steve Muckle
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-03-01 20:20 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Ingo Molnar

On Tue, Mar 1, 2016 at 5:10 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/27/2016 07:24 AM, Rafael J. Wysocki wrote:
>>>> +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
>>>> +                               unsigned long util, unsigned long max,
>>>> +                               u64 last_sample_time)
>>>> +{
>>>> +   struct cpufreq_policy *policy = policy_dbs->policy;
>>>> +   unsigned int min_f = policy->min;
>>>> +   unsigned int max_f = policy->max;
>>>> +   unsigned int j;
>>>> +
>>>> +   if (util > max || min_f >= max_f)
>>>> +           return max_f;
>>>> +
>>>> +   for_each_cpu(j, policy->cpus) {
>>>> +           struct sugov_cpu *j_sg_cpu;
>>>> +           unsigned long j_util, j_max;
>>>> +
>>>> +           if (j == smp_processor_id())
>>>> +                   continue;
>>>> +
>>>> +           j_sg_cpu = &per_cpu(sugov_cpu, j);
>>>> +           /*
>>>> +            * If the CPU was last updated before the previous sampling
>>>> +            * time, we don't take it into account as it probably is idle
>>>> +            * now.
>>>> +            */
>>>
>>> If the sampling rate is less than the tick, it seems possible a busy CPU
>>> may not manage to get an update in within the current sampling period.
>>
>> Right.
>>
>> sampling_rate is more of a rate limit here, though, because the governor doesn't
>> really do any sampling (I was talking about that in the intro message at
>> http://marc.info/?l=linux-pm&m=145609673008122&w=2).
>>
>> It was easy to re-use the existing show/store for that, so I took the shortcut,
>> but I'm considering to introduce a new attribute with a more suitable name for
>> that.  That would help to avoid a couple of unclean things in patch [2/2] as
>> well if I'm not mistaken.
>>
>>> What about checking to see if there was an update within the last tick
>>> period, rather than sample period?
>>
>> If your rate limit is set below the rate of the updates (scheduling rate more
>> or less), it simply has no effect.  To me, that case doesn't require any
>> special care.
>
> I'm specifically worried about the check below where we omit a CPU's
> capacity request if its last update came before the last sample time.
>
> Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
> delay here is 4ms.

Yes, that's the case I clearly didn't take into consideration. :-)

My assumption was that the sample delay would always be greater than
the typical update rate which of course need not be the case.

The reason I added the check at all was that the numbers from the
other CPUs may become stale if those CPUs are idle for too long, so at
one point the contributions from them need to be discarded.  Question
is when that point is and since sample delay may be arbitrary, that
mechanism has to be more complex.

> At t=0ms CPU0 starts running a CPU hog and reports
> being fully busy. At t=5ms CPU1 starts running a task - CPU0's vote is
> kept and factored in and last_sample_time is updated to t=5ms. At t=9ms
> CPU1 stops running the task and updates again, but this time CPU0's vote
> is discarded because its last update is t=0ms, and last_sample_time is
> 5ms. But CPU0 is fully busy and the freq is erroneously dropped.
>

[cut]

>>
>> Like I said in my reply to Peter in that thread, using RELATION_L here is likely
>> to make us avoid the min frequency almost entirely even if the system is almost
>> completely idle.  I don't think that would be OK really.
>>
>> That said my opinion about this particular item isn't really strong.
>
> I think the calculation for required CPU bandwidth needs tweaking.

The reason why I used that particular formula was that ondemand used
it.  Of course, the input to it is different in ondemand, but the idea
here is to avoid departing from it too much.

> Aside from always wanting something past fmin, currently the amount of
> extra CPU capacity given for a particular % utilization depends on how
> high the platform's fmin happens to be, even if the fmax speeds are the
> same. For example given two platforms with the following available
> frequencies (MHz):
>
> platform A: 100, 300, 500, 700, 900, 1100
> platform B: 500, 700, 900, 1100

The frequencies may not determine raw performance, though, so 500 MHz
in platform A may correspond to 700 MHz in platform B.  You never
know.

>
> A 50% utilization load on platform A will want 600 MHz (rounding up to
> 700 MHz perhaps) whereas platform B will want 800 MHz (again likely
> rounding up to 900 MHz), even though the load consumes 550 MHz on both
> platforms.
>
> One possibility would be something like we had in schedfreq, getting the
> absolute CPU bw requirement (util/max) * fmax and then adding some %
> margin, which I think is more consistent. It is true that it means
> figuring out what the right margin is and now there's a magic number
> (and potentially a tunable), but it would be more consistent.
>

What the picture is missing is the information on how much more
performance you get by running in a higher P-state (or OPP if you
will).  We don't have that information, however, and relying on
frequency values here generally  doesn't help.

Moreover, since 0 utilization gets you to run in f_min no matter what,
if you treat f_max as an absolute, you're going to underutilize the
P-states in the upper half of the available range.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler
  2016-03-01 14:56           ` Juri Lelli
@ 2016-03-01 20:26             ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-03-01 20:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Steve Muckle, Ingo Molnar

On Tue, Mar 1, 2016 at 3:56 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 26/02/16 03:36, Rafael J. Wysocki wrote:
>> On Thursday, February 25, 2016 11:01:20 AM Juri Lelli wrote:
>
> [...]
>
>> >
>> > That is right. But, can't an higher priority class eat all the needed
>> > capacity. I mean, suppose that both CFS and DL need 30% of CPU capacity
>> > on the same CPU. DL wins and gets its 30% of capacity. When CFS gets to
>> > run it's too late for requesting anything more (w.r.t. the same time
>> > window). If we somehow aggregate requests instead, we could request 60%
>> > and both classes can have their capacity to run. It seems to me that
>> > this is what governors were already doing by using the 1 - idle metric.
>>
>> That's interesting, because it is about a few different things at a time. :-)
>>
>> So first of all the "old" governors only collect information about what
>> happened in the past and make decisions on that basis (kind of in the hope
>> that what happened once will happen again), while the idea behind what
>> you're describing seems to be to attempt to project future needs for
>> capacity and use that to make decisions (just for the very near future,
>> but that should be sufficient).  If successful, that would be the most
>> suitable approach IMO.
>>
>
> Right, this is a key difference.
>
>> Of course, the $subject patch is not aspiring to anything of that kind.
>> It only uses information about current needs that's already available to
>> it in a very straightforward way.
>>
>
> But, using utilization of CFS tasks (based on PELT) has already some
> notion of "future needs" (even if it is true that tasks might have
> phases). And this will be true for DL as well, once we will have a
> corresponding utilization signal that we can consume. I think you are
> already consuming information about the future in some sense. :-)

That's because the already available numbers include that information.
I don't do any projections myself.

>> But there's more to it.  In the sampling, or rate-limiting if you will,
>> situation you really have a window in which many things can happen and
>> making a good decision at the beginning of it is important.  However, if
>> you just can handle *every* request and really switch frequencies on the
>> fly, then each of them may come with a "currently needed capacity" number
>> and you can just give it what it asks for every time.
>>
>
> True. Rate-limiting poses interesting problems.
>
>> My point is that there are quite a few things to consider here and I'm
>> expecting a learning process to happen before we are happy with what we
>> have.  So my approach would be (and is) to start very simple and then
>> add more complexity over time as needed instead of just trying to address
>> every issue I can think about from the outset.
>>
>
> I perfectly understand that, and I agree that there is value in starting
> simple. I simply fear that aggregation of utilization signals will be one
> of the few things that will pop out fairly soon. :-)

That's OK.  If it is demonstrably better than the super-simple initial
approach, there won't be any reason to reject it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-03-01 20:20             ` Rafael J. Wysocki
@ 2016-03-03  3:20               ` Steve Muckle
  2016-03-03  3:35                 ` Steve Muckle
  2016-03-03 19:20                 ` Rafael J. Wysocki
  0 siblings, 2 replies; 32+ messages in thread
From: Steve Muckle @ 2016-03-03  3:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Ingo Molnar

On 03/01/2016 12:20 PM, Rafael J. Wysocki wrote:
>> I'm specifically worried about the check below where we omit a CPU's
>> capacity request if its last update came before the last sample time.
>>
>> Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
>> delay here is 4ms.
> 
> Yes, that's the case I clearly didn't take into consideration. :-)
> 
> My assumption was that the sample delay would always be greater than
> the typical update rate which of course need not be the case.
> 
> The reason I added the check at all was that the numbers from the
> other CPUs may become stale if those CPUs are idle for too long, so at
> one point the contributions from them need to be discarded.  Question
> is when that point is and since sample delay may be arbitrary, that
> mechanism has to be more complex.

Yeah this has been an open issue on our end as well. Sampling-based
governors of course solved this primarily via their fundamental nature
and sampling rate. The interactive governor also has a separate tunable
IIRC which specified how long a CPU may have its sampling timer deferred
due to idle when running @ > fmin (the "slack timer").

Decoupling the CPU update staleness limit from the freq change rate
limit via a separate tunable would be valuable IMO. Would you be
amenable to a patch that did that?

>>> Like I said in my reply to Peter in that thread, using RELATION_L here is likely
>>> to make us avoid the min frequency almost entirely even if the system is almost
>>> completely idle.  I don't think that would be OK really.
>>>
>>> That said my opinion about this particular item isn't really strong.
>>
>> I think the calculation for required CPU bandwidth needs tweaking.
> 
> The reason why I used that particular formula was that ondemand used
> it.  Of course, the input to it is different in ondemand, but the idea
> here is to avoid departing from it too much.
> 
>> Aside from always wanting something past fmin, currently the amount of
>> extra CPU capacity given for a particular % utilization depends on how
>> high the platform's fmin happens to be, even if the fmax speeds are the
>> same. For example given two platforms with the following available
>> frequencies (MHz):
>>
>> platform A: 100, 300, 500, 700, 900, 1100
>> platform B: 500, 700, 900, 1100
> 
> The frequencies may not determine raw performance, though, so 500 MHz
> in platform A may correspond to 700 MHz in platform B.  You never
> know.

My example here was solely intended to illustrate that the current
algorithm itself introduces an inconsistency in policy when other things
are equal. Depending on the fmin value, this ondemand-style calculation
will give a more or less generous amount of CPU bandwidth headroom to a
platform with a higher fmin.

It'd be good to be able to express the desired amount of CPU bandwidth
headroom in such a way that it doesn't depend on the platform's fmin
value, since CPU headroom is a critical factor in tuning a platform's
governor for optimal power and performance.

> 
>>
>> A 50% utilization load on platform A will want 600 MHz (rounding up to
>> 700 MHz perhaps) whereas platform B will want 800 MHz (again likely
>> rounding up to 900 MHz), even though the load consumes 550 MHz on both
>> platforms.
>>
>> One possibility would be something like we had in schedfreq, getting the
>> absolute CPU bw requirement (util/max) * fmax and then adding some %
>> margin, which I think is more consistent. It is true that it means
>> figuring out what the right margin is and now there's a magic number
>> (and potentially a tunable), but it would be more consistent.
>>
> 
> What the picture is missing is the information on how much more
> performance you get by running in a higher P-state (or OPP if you
> will).  We don't have that information, however, and relying on
> frequency values here generally  doesn't help.

Why does the frequency value not help? It is true there may be issues of
a workload being memory bound and not responding quite linearly to
increasing frequency, but that would pose a problem for the current
algorithm also. Surely it's better to attempt a consistent policy which
doesn't vary based on a platform's fmin value?

> Moreover, since 0 utilization gets you to run in f_min no matter what,
> if you treat f_max as an absolute, you're going to underutilize the
> P-states in the upper half of the available range.

Sorry I didn't follow. What do you mean by underutilize the upper half
of the range? I don't see how using RELATION_L with (util/max) * fmax *
(headroom) wouldn't be correct in that regard.

thanks,
Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-03-03  3:20               ` Steve Muckle
@ 2016-03-03  3:35                 ` Steve Muckle
  2016-03-03 19:20                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 32+ messages in thread
From: Steve Muckle @ 2016-03-03  3:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Ingo Molnar

On 03/02/2016 07:20 PM, Steve Muckle wrote:
> Why does the frequency value not help? It is true there may be issues of
> a workload being memory bound and not responding quite linearly to
> increasing frequency, but that would pose a problem for the current
> algorithm also. Surely it's better to attempt a consistent policy which
> doesn't vary based on a platform's fmin value?

FWIW I'm not trying to hold up this series - rather just discuss
possibilities and differences with the now deprecated solution that may
be able to be integrated here sometime in the near future.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data
  2016-02-21 23:16 [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
  2016-02-21 23:18 ` [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
  2016-02-24  1:20 ` [RFC/RFT][PATCH v2 0/2] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
@ 2016-03-03 14:27 ` Ingo Molnar
  2016-03-03 17:15   ` Rafael J. Wysocki
  2 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2016-03-03 14:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Juri Lelli, Linux Kernel Mailing List,
	Viresh Kumar, Srinivas Pandruvada, Peter Zijlstra, Steve Muckle


So I wanted to give you some feedback for this, from the scheduler maintainer's 
POV.

Looks like there are two cpufreq modernization efforts, one is this series, the 
other is Steve Muckle's:

  [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection

What I'd like to see from a scheduler metrics usage POV is a single central place, 
kernel/sched/cpufreq.c, where all the high level ('governor') decisions are made.  
This is the approach Steve's series takes.

That is a central point that has ready access to the scheduler internal 
utilization metrics.

drivers/cpufreq/ would contain legacy governors plus low level drivers that do 
frequency switching with a well-defined interface.

Could you guys work out a single series that implements the sum of the two series? 
Looks like we are 90% 'there' already.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data
  2016-03-03 14:27 ` [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Ingo Molnar
@ 2016-03-03 17:15   ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-03-03 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Steve Muckle, Thomas Gleixner

On Thu, Mar 3, 2016 at 3:27 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> So I wanted to give you some feedback for this, from the scheduler maintainer's
> POV.
>
> Looks like there are two cpufreq modernization efforts, one is this series, the
> other is Steve Muckle's:
>
>   [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection
>
> What I'd like to see from a scheduler metrics usage POV is a single central place,
> kernel/sched/cpufreq.c, where all the high level ('governor') decisions are made.
> This is the approach Steve's series takes.

The difference between this series and the Steve's one in that respect
is only the place where the new governor goes.  I can put it into
kernel/sched/ if you want me to, but it still will depend on some
things under drivers/cpufreq/.

> That is a central point that has ready access to the scheduler internal
> utilization metrics.
>
> drivers/cpufreq/ would contain legacy governors plus low level drivers that do
> frequency switching with a well-defined interface.
>
> Could you guys work out a single series that implements the sum of the two series?
> Looks like we are 90% 'there' already.

I'd like to have a clear picture of what you want here, so let me use
the opportunity to ask you about things.

I've CCed you on many occasions during this discussion and you have
been silent till now, so I have assumed that you have no objections.
>From what you're saying now, it looks like that may not be the case,
though.

I have a bunch of changes queued up for the next cycle that depend on
things in cpufreq in general to be called from the scheduler on a
regular basis instead of using timers.  There are two reasons for
that: first, having to set up a timer for every CPU every 10 ms or so
is quite a bit of overhead and Thomas doesn't like that from the timer
wheel management perspective and, second, getting rid of those timers
allows quite some irritating bugs in cpufreq to be fixed.  That's why
there is a metric ton of cpufreq cleanups and fixes on top of that in
my tree.

However, that requires an interface for cpufreq governors to provide
callbacks to be invoked from the scheduler.  Peter suggested to me how
that could be done and those callbacks get the scheduler utilization
numbers as arguments.  From what you're saying now it seems to me that
you may not agree with that approach.  It looks like you would prefer
it if the utilization numbers were not passed to those callbacks
unless they have been provided by the new "scheduler" governor which
then would reside under kernel/sched/, so there's a clear interface
separation between the "old style" cpufreq governors and the
scheduler.

Am I reading this correctly?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler
  2016-03-03  3:20               ` Steve Muckle
  2016-03-03  3:35                 ` Steve Muckle
@ 2016-03-03 19:20                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2016-03-03 19:20 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli,
	Linux Kernel Mailing List, Viresh Kumar, Srinivas Pandruvada,
	Peter Zijlstra, Ingo Molnar

On Thu, Mar 3, 2016 at 4:20 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 03/01/2016 12:20 PM, Rafael J. Wysocki wrote:
>>> I'm specifically worried about the check below where we omit a CPU's
>>> capacity request if its last update came before the last sample time.
>>>
>>> Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
>>> delay here is 4ms.
>>
>> Yes, that's the case I clearly didn't take into consideration. :-)
>>
>> My assumption was that the sample delay would always be greater than
>> the typical update rate which of course need not be the case.
>>
>> The reason I added the check at all was that the numbers from the
>> other CPUs may become stale if those CPUs are idle for too long, so at
>> one point the contributions from them need to be discarded.  Question
>> is when that point is and since sample delay may be arbitrary, that
>> mechanism has to be more complex.
>
> Yeah this has been an open issue on our end as well. Sampling-based
> governors of course solved this primarily via their fundamental nature
> and sampling rate. The interactive governor also has a separate tunable
> IIRC which specified how long a CPU may have its sampling timer deferred
> due to idle when running @ > fmin (the "slack timer").
>
> Decoupling the CPU update staleness limit from the freq change rate
> limit via a separate tunable would be valuable IMO. Would you be
> amenable to a patch that did that?

Yes, I would.

It still would be better, though, if that didn't have to be a tunable.

What do you think about my idea to use NSEC_PER_SEC / HZ as the
staleness limit (like in https://patchwork.kernel.org/patch/8477261/)?

[cut]

>> Moreover, since 0 utilization gets you to run in f_min no matter what,
>> if you treat f_max as an absolute, you're going to underutilize the
>> P-states in the upper half of the available range.
>
> Sorry I didn't follow. What do you mean by underutilize the upper half
> of the range? I don't see how using RELATION_L with (util/max) * fmax *
> (headroom) wouldn't be correct in that regard.

Suppose all of the util values from 0 to max are equally probable (or
equally frequent) and the available frequencies are close enough to
each other that it doesn't really matter whether _C or _L is used.

Say f_min is 400 and f_max is 1000.

Then, if you take next_freq = f_max * util / max, 50% of requests will
fall into the 400-500 section of the available frequency range.  Of
course, 40% of them will fall to f_min, but that means that the other
available states will be used less frequently, on the average.

I would prefer that to be more balanced.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-03-03 19:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-21 23:16 [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
2016-02-21 23:18 ` [RFC/RFT][PATCH 1/1] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
2016-02-22 14:16   ` Juri Lelli
2016-02-22 23:02     ` Rafael J. Wysocki
2016-02-23  7:20       ` Steve Muckle
2016-02-24  1:38         ` Rafael J. Wysocki
2016-02-25 11:01       ` Juri Lelli
2016-02-26  2:36         ` Rafael J. Wysocki
2016-03-01 14:56           ` Juri Lelli
2016-03-01 20:26             ` Rafael J. Wysocki
2016-02-24  1:20 ` [RFC/RFT][PATCH v2 0/2] cpufreq: New governor based on scheduler-provided utilization data Rafael J. Wysocki
2016-02-24  1:22   ` [RFC/RFT][PATCH v2 1/2] cpufreq: New governor using utilization data from the scheduler Rafael J. Wysocki
2016-02-25 21:14     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
2016-02-27  0:21       ` Rafael J. Wysocki
2016-02-27  4:33       ` Steve Muckle
2016-02-27 15:24         ` Rafael J. Wysocki
2016-03-01  4:10           ` Steve Muckle
2016-03-01 20:20             ` Rafael J. Wysocki
2016-03-03  3:20               ` Steve Muckle
2016-03-03  3:35                 ` Steve Muckle
2016-03-03 19:20                 ` Rafael J. Wysocki
2016-02-24  1:28   ` [RFC/RFT][PATCH v2 2/2] cpufreq: schedutil: Switching frequencies from interrupt context Rafael J. Wysocki
2016-02-24 23:30     ` [RFC/RFT][PATCH v3 " Rafael J. Wysocki
2016-02-25  9:08       ` Peter Zijlstra
2016-02-25  9:12         ` Peter Zijlstra
2016-02-25 11:11           ` Rafael J. Wysocki
2016-02-25 11:10         ` Rafael J. Wysocki
2016-02-25 11:52           ` Peter Zijlstra
2016-02-25 20:54             ` Rafael J. Wysocki
2016-02-25 21:20     ` [RFC/RFT][PATCH v4 " Rafael J. Wysocki
2016-03-03 14:27 ` [RFC/RFT][PATCH 0/1] cpufreq: New governor based on scheduler-provided utilization data Ingo Molnar
2016-03-03 17:15   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).