linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv4 0/6] Add utilization clamping to the CPU controller
@ 2017-08-24 18:08 Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 1/6] sched/core: add utilization clamping to " Patrick Bellasi
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

Was:
 - RFCv3: Add capacity capping support to the CPU controller
 - RFCv2: SchedTune: central, scheduler-driven, power-performance control

This is a respin of the series implementing the support for per-task
boosting and capping of CPU frequency. This new version addresses most
of the comments collected since last posting on LKML [1] and from the
discussions at the OSPM Summit [2].

Hereafter a short description of the main changes since the previous
posting [1].

.:: Concept: "capacity clamping" replaced by "utilization clamping"

The previous implementation was expressed in terms of "capacity
clamping", which generated some confusion mainly due to the fact that in
mainline the capacity is currently defined as a "constant property" of a
CPU. Here is a email [3] which resumes the confusion generated by the
previous proposal.

As Peter pointed out, the goal of this proposal is to "affect" the
util_avg metric, i.e. the CPU utilization, and the way that signal is
used for example by schedutil. Thus, both from a conceptual and
implementation standpoint, it actually makes a lot more sense
to talk about "utilization clamping".

In this new proposal, the couple of new attributes added to the CPU
controller allow to define the minimum and maximum utilization which
should be considered for the set of tasks in a group.
These utilization clamp values can be used, for example, to either
"boost" or "cap" the actual frequency selected by schedutil when one of
these tasks is RUNNABLE on a CPU.
A proper aggregation mechanism is also provided to handle the cases
where tasks with different utilization clamp values are co-scheduled on
the same CPU.

.:: Implementation: rb-trees replaced by reference counting

The previous implementation used a couple of rb-trees to aggregate the
different clamp values of tasks co-scheduled on the same CPU.  Although
being a simple solution, from the coding standpoint, Peter pointed out
that it was definitively adding not negligible overheads in the fast path
(i.e. tasks enqueue/dequeue) especially on highly loaded systems.

This new implementation is based on a much more lightweight mechanism
using reference counting. The new solution requires just to
{in,de}crement an integer counter each time a task is {en,de}eueued.
The most expensive operation is now a sequential scan of a small and
per-CPU array of integers, which is also defined to easily fit into a
single cache line.

Scheduler performance overheads have been measured using the performance
governor to run 20 iterations of:

   perf bench sched messaging --pipe --thread --group 2 --loop 5000

on a Juno R2 board (4xA53, 2xA72).
With this new implementation we cannot see any sensible impact when
comparing with the same benchmark running on tip/sched/core
(as in 9c8783201). For the records, the previous implementation showed
~1.5% overhead using the same test.

.:: Other comments: use-cases description

People had concerns about use-cases, in a previous posting [4] I've resumed
the main use cases we are targeting with this proposal. Further discussion
went on at OSPM, outside of the official tracks, and I've got the feeling
that people (at least Peter and Raphael) seem to recognize the interest in
having a support to both boosting or clamping of the CPU frequencies,
based on currently active tasks.

Main discussed use cases was (refer to [4] further details):

 - boosting: better interactive response for small tasks which
   are affecting the user experience. Consider for example the case of a
   small control thread for an external accelerator (e.g. GPU, DSP, other
   devices). In this case the scheduler does not have a complete view of
   what are the task bandwidth requirements and if, it's a small task,
   schedutil will keep selecting a lower frequency thus affecting the
   overall time required to complete its activations.

 - clamping: increase energy efficiency for background tasks not directly
   affecting the user experience. Since running at a lower frequency is in
   general more energy efficient, when the completion time is not a main
   goal then clamping the maximum frequency to use for certain (maybe big)
   tasks can have positive effects, both on power dissipation and energy
   consumption.
   Moreover, this last support allows also to make RT tasks more energy
   friendly on mobile systems, whenever running them at the maximum
   frequency is not strictly required.

.:: Other comments: usage of CGroups as a main interface

The current implementation is based on CGroups but does not strictly
depend on that API. We do not propose a different main interface just
because, so far, all the use-cases we have on hand can take advantage
from a CGroups API (notably the Android run-time).

In case there should be the need for a different API, the current
implementation can be easily extended to hook its internals to a
different API. However, we believe it's not worth adding the maintenance
burden for an additional API until there is not a real demand.

.:: Patches organization

The first three patches of this series introduce util_{min,max} tracking
in the core scheduler, as an extension of the CPU controller.
The fourth patch is dedicated to the synchronization between the cgroup
interface (slow-path) and the core scheduler (fast-path).
The last two patches integrate the utilization clamping support with
schedutil for FAIR tasks and RT/DL tasks too.

A detailed validation and analysis of the proposed features is available
in this notebook:
   https://gist.github.com/7f9170e613dea25fe248e14157e6cb23

Cheers Patrick

.:: References
[1] https://lkml.org/lkml/2017/2/28/355
[2] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
    video:  http://youtu.be/6MC1jbYbQTo
[3] https://lkml.org/lkml/2017/4/11/670
[4] https://lkml.org/lkml/2017/3/20/688

Patrick Bellasi (6):
  sched/core: add utilization clamping to CPU controller
  sched/core: map cpu's task groups to clamp groups
  sched/core: reference count active tasks's clamp groups
  sched/core: sync task_group's with CPU's clamp groups
  cpufreq: schedutil: add util clamp for FAIR tasks
  cpufreq: schedutil: add util clamp for RT/DL tasks

 include/linux/sched.h            |  12 +
 init/Kconfig                     |  36 ++
 kernel/sched/core.c              | 706 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |  49 ++-
 kernel/sched/sched.h             | 199 +++++++++++
 5 files changed, 998 insertions(+), 4 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFCv4 1/6] sched/core: add utilization clamping to CPU controller
  2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
@ 2017-08-24 18:08 ` Patrick Bellasi
  2017-08-28 18:23   ` Tejun Heo
  2017-08-24 18:08 ` [RFCv4 2/6] sched/core: map cpu's task groups to clamp groups Patrick Bellasi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on the actual tasks utilization.
Thus, it is now possible to extend the cpu controller to specify what
is the minimum (or maximum) utilization which a task is allowed to
generate.  By adding new constraints on minimum and maximum utilization
allowed for tasks in a cpu control group it will be possible to better
control the actual amount of CPU bandwidth consumed by these tasks.

The ultimate goal of this new pair of constraints is to enable:

- boosting: by selecting a higher execution frequency for small tasks
            which are affecting the user interactive experience

- capping: by enforcing lower execution frequency (which usually improves
	   energy efficiency) for big tasks which are mainly related to
	   background activities without a direct impact on the user
	   experience.

This patch extends the CPU controller by adding a couple of new attributes,
util_min and util_max, which can be used to enforce frequency boosting and
capping. Specifically:

- util_min: defines the minimum CPU utilization which should be considered,
	    e.g. when  schedutil selects the frequency for a CPU while a
	    task in this group is RUNNABLE.
	    i.e. the task will run at least at a minimum frequency which
	         corresponds to the min_util utilization

- util_max: defines the maximum CPU utilization which should be considered,
	    e.g. when schedutil selects the frequency for a CPU while a
	    task in this group is RUNNABLE.
	    i.e. the task will run up to a maximum frequency which
	         corresponds to the max_util utilization

These attributes:
a) are tunable at all hierarchy levels, i.e. at root group level too, thus
   allowing to defined minimum and maximum frequency constraints for all
   otherwise non-classified tasks (e.g. autogroups)
b) allow to create subgroups of tasks which are not violating the
   utilization constraints defined by the parent group.

Tasks on a subgroup can only be more boosted and/or capped, which is
matching with the "limits" schema proposed by the "Resource Distribution
Model (RDM)" suggested by the CGroups v2 documentation:
   Documentation/cgroup-v2.txt

This patch provides the basic support to expose the two new attributes and
to validate their run-time update based on the "limits" of the
aforementioned RDM schema.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 include/linux/sched.h |   7 ++
 init/Kconfig          |  17 +++++
 kernel/sched/core.c   | 180 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  22 ++++++
 4 files changed, 226 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c28b182c9833..265ac0898f9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -241,6 +241,13 @@ struct vtime {
 	u64			gtime;
 };
 
+enum uclamp_id {
+	UCLAMP_MIN = 0, /* Minimum utilization */
+	UCLAMP_MAX,     /* Maximum utilization */
+	/* Utilization clamping constraints count */
+	UCLAMP_CNT
+};
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..db736529f08b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -754,6 +754,23 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config UTIL_CLAMP
+	bool "Utilization clamping per group of tasks"
+	depends on CPU_FREQ_GOV_SCHEDUTIL
+	depends on CGROUP_SCHED
+	default n
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+	  When this option is enabled, the user can specify a min and max
+	  CPU bandwidth which is allowed for each single task in a group.
+	  The max bandwidth allows to clamp the maximum frequency a task
+	  can use, while the min bandwidth allows to define a minimum
+	  frequency a task will alwasy use.
+
+	  If in doubt, say N.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9f9948e2470..20b5a11d64ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -751,6 +751,48 @@ static void set_load_weight(struct task_struct *p)
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
+#ifdef CONFIG_UTIL_CLAMP
+/**
+ * uclamp_mutex: serialize updates of TG's utilization clamp values
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ */
+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+					    struct task_group *parent)
+{
+	int clamp_id;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
+}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static inline void init_uclamp(void)
+{
+	int clamp_id;
+
+	mutex_init(&uclamp_mutex);
+
+	/* Initialize root TG's to default (none) clamp values */
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		root_task_group.uclamp[clamp_id] = uclamp_none(clamp_id);
+}
+#else
+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+					    struct task_group *parent) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UTIL_CLAMP */
+
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (!(flags & ENQUEUE_NOCLOCK))
@@ -5907,6 +5949,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	init_uclamp();
+
 	scheduler_running = 1;
 }
 
@@ -6099,6 +6143,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	alloc_uclamp_sched_group(tg, parent);
+
 	return tg;
 
 err:
@@ -6319,6 +6365,128 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_UTIL_CLAMP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 min_value)
+{
+	struct cgroup_subsys_state *pos;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	if (min_value > SCHED_CAPACITY_SCALE)
+		return ret;
+
+	mutex_lock(&uclamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->uclamp[UCLAMP_MIN] == min_value) {
+		ret = 0;
+		goto out;
+	}
+
+	/* Ensure to not exceed the maximum clamp value */
+	if (tg->uclamp[UCLAMP_MAX] < min_value)
+		goto out;
+
+	/* Ensure min clamp fits within parent's clamp value */
+	if (tg->parent &&
+	    tg->parent->uclamp[UCLAMP_MIN] > min_value)
+		goto out;
+
+	/* Ensure each child is a restriction of this TG */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->uclamp[UCLAMP_MIN] < min_value)
+			goto out;
+	}
+
+	/* Update TG's utilization clamp */
+	tg->uclamp[UCLAMP_MIN] = min_value;
+	ret = 0;
+
+out:
+	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
+
+	return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 max_value)
+{
+	struct cgroup_subsys_state *pos;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	if (max_value > SCHED_CAPACITY_SCALE)
+		return ret;
+
+	mutex_lock(&uclamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->uclamp[UCLAMP_MAX] == max_value) {
+		ret = 0;
+		goto out;
+	}
+
+	/* Ensure to not go below the minimum clamp value */
+	if (tg->uclamp[UCLAMP_MIN] > max_value)
+		goto out;
+
+	/* Ensure max clamp fits within parent's clamp value */
+	if (tg->parent &&
+	    tg->parent->uclamp[UCLAMP_MAX] < max_value)
+		goto out;
+
+	/* Ensure each child is a restriction of this TG */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->uclamp[UCLAMP_MAX] > max_value)
+			goto out;
+	}
+
+	/* Update TG's utilization clamp */
+	tg->uclamp[UCLAMP_MAX] = max_value;
+	ret = 0;
+
+out:
+	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
+
+	return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+				  enum uclamp_id clamp_id)
+{
+	struct task_group *tg;
+	u64 util_clamp;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	util_clamp = tg->uclamp[clamp_id];
+	rcu_read_unlock();
+
+	return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UTIL_CLAMP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -6641,6 +6809,18 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_UTIL_CLAMP
+	{
+		.name = "util_min",
+		.read_u64 = cpu_util_min_read_u64,
+		.write_u64 = cpu_util_min_write_u64,
+	},
+	{
+		.name = "util_max",
+		.read_u64 = cpu_util_max_read_u64,
+		.write_u64 = cpu_util_max_write_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..982340b8870b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -330,6 +330,10 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UTIL_CLAMP
+	unsigned int uclamp[UCLAMP_CNT];
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -365,6 +369,24 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
 
 extern int tg_nop(struct task_group *tg, void *data);
 
+#ifdef CONFIG_UTIL_CLAMP
+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+	if (clamp_id == UCLAMP_MIN)
+		return 0;
+	return SCHED_CAPACITY_SCALE;
+}
+#endif /* CONFIG_UTIL_CLAMP */
+
 extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
 extern void online_fair_sched_group(struct task_group *tg);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFCv4 2/6] sched/core: map cpu's task groups to clamp groups
  2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 1/6] sched/core: add utilization clamping to " Patrick Bellasi
@ 2017-08-24 18:08 ` Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 3/6] sched/core: reference count active tasks's " Patrick Bellasi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

To properly support per-task utilization clamping, each CPU needs to know
which clamp values are required by tasks currently RUNNABLE on that CPU.
However, multiple task groups can define the same clamp value for a given
clamp index (i.e. util_{min,max}).
Thus, a mechanism is required to map clamp values to a properly defined
data structure which is suitable for fast and efficient aggregation of
clamp values coming from tasks belonging to different task_groups.

Such a data structure can be an array of reference counters, where each
slot is used to account how many tasks requiring a certain clamp value
are currently enqueued. Thus each clamp value can be mapped into a
"clamp index" which identifies the position within the reference
counters array.

                                 :
                                 :
             SLOW PATH           :             FAST PATH
         task_group::write       :     sched/core::enqueue/dequeue
                                 :         cpufreq_schedutil
                                 :
  +----------------+    +--------------------+     +-------------------+
  |   TASK GROUP   |    |     CLAMP GROUP    |     |    CPU CLAMPS     |
  +----------------+    +--------------------+     +-------------------+
  |                |    |   clamp_{min,max}  |     |  clamp_{min,max}  |
  | util_{min,max} |    |      tg_count      |     |    tasks count    |
  +----------------+    +--------------------+     +-------------------+
                                 :
           +------------------>  :  +------------------->
  map(clamp_value, clamp_group)  :  ref_count(clamp_group)
                                 :
                                 :
                                 :

This patch introduces the support to map task groups on "clamp groups".
Specifically it introduces the required functions to translate a clamp
value into a clamp group index.

Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
   sense to boost/cap to different frequencies
   e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
   the per-CPU clamp values in the fast path

The number of possible different clamp values is currently defined at
compile time. It's worth to notice that this does not impose a limitation
on the number of task groups that can be generated. Indeed, each new task
group always maps to the clamp groups of its parent.
Instead, changing the clamp value for a TG can result into a -ENOSPC error
in case this will exceed the number of maximum different clamp values
supported.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 init/Kconfig         |  19 +++
 kernel/sched/core.c  | 348 +++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  21 +++-
 3 files changed, 363 insertions(+), 25 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index db736529f08b..5f0c246f2a3a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -771,6 +771,25 @@ config UTIL_CLAMP
 
 	  If in doubt, say N.
 
+config UCLAMP_GROUPS_COUNT
+	int "Number of utilization clamp values supported"
+	range 1 127
+	depends on UTIL_CLAMP
+	default 3
+	help
+	  This defines the maximum number of different utilization clamp
+	  values which can be concurrently enforced for each utilization
+	  clamp index (i.e. minimum and maximum).
+
+	  Only a limited number of clamp values are supported because:
+	    1. there are usually only few classes of workloads for which is
+	       makese sense to boost/cap for different frequencies
+	       e.g. background vs foreground, interactive vs low-priority
+	    2. it allows a simpler and more memory/time efficient tracking of
+	       the per-CPU clamp values.
+
+	  If in doubt, use the default value.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20b5a11d64ab..0d39766f2b03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -757,6 +757,232 @@ static void set_load_weight(struct task_struct *p)
  */
 static DEFINE_MUTEX(uclamp_mutex);
 
+/**
+ * uclamp_map: a clamp group representing a clamp value
+ *
+ * Since only a limited number of different clamp values are supported, this
+ * map allows to track how many TG's use the same clamp value and it defines
+ * the clamp group index used by the per-CPU accounting in the fast-path
+ * (i.e. tasks enqueuing/dequeuing)
+ *
+ * Since we support both max and min utilization clamp value, a matrix is used
+ * to map clamp values to group indexes:
+ * - rows map the different clamp indexes
+ *   i.e. minimum or maximum utilization
+ * - columns map the different clamp groups
+ *   i.e. TG's with similar clamp value for a given clamp index
+ *
+ * NOTE: clamp group 0 is reserved for the tracking of non clamped tasks.
+ * Thus we allocate one more slot than the value of
+ * CONFIG_UCLAMP_GROUPS_COUNT.
+ */
+struct uclamp_map {
+	int value;
+	int tg_count;
+	raw_spinlock_t tg_lock;
+};
+
+/**
+ * uclamp_maps: map TGs into per-CPU utilization clamp values
+ *
+ * This is a matrix where:
+ * - rows are indexed by clamp_id, and collects the clamp group for a given
+ *   clamp index (i.e. minimum or maximum utilization)
+ * - cols are indexed by group_id, and represents an actual clamp group (i.e.
+ *   a uclamp_map instance)
+ *
+ * Here is the map layout and, right below, how entries are accessed by the
+ * following code.
+ *
+ *                          uclamp_maps is a matrix of
+ *          +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
+ *          |                                |
+ *          |                /---------------+---------------\
+ *          |               +------------+       +------------+
+ *          |  / UCLAMP_MIN | value      |       | value      |
+ *          |  |            | tg_count   |...... | tg_count   |
+ *          |  |            +------------+       +------------+
+ *          +--+            +------------+       +------------+
+ *             |            | value      |       | value      |
+ *             \ UCLAMP_MAX | tg_count   |...... | tg_count   |
+ *                          +-----^------+       +----^-------+
+ *                                |                   |
+ *                      uc_map =  +                   |
+ *                     &uclamp_maps[clamp_id][0]      +
+ *                                                clamp_value =
+ *                                       uc_map[group_id].value
+ */
+static struct uclamp_map uclamp_maps[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+
+/**
+ * uclamp_group_available: check if a clamp group is available
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index in the given clamp_id
+ *
+ * A clamp group is not free if there is at least one TG's using a clamp value
+ * mapped on the specified clamp_id. These TG's are reference counted by the
+ * tg_count of a uclamp_map entry.
+ *
+ * Return: true if there are no TG's mapped on the specified clamp
+ *         index and group
+ */
+static inline bool uclamp_group_available(int clamp_id, int group_id)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+	return (uc_map[group_id].value == UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_init: map a clamp value on a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index to map a given clamp_value
+ * @clamp_value: the utilization clamp value to map
+ *
+ * Each different clamp value, for a given clamp index (i.e. min/max
+ * utilization clamp), is mapped by a clamp group which index is use by the
+ * fast-path code to keep track of active tasks requiring a certain clamp
+ * value.
+ *
+ * This function initializes a clamp group to track tasks from the fast-path.
+ */
+static inline void uclamp_group_init(int clamp_id, int group_id,
+				     unsigned int clamp_value)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+	uc_map[group_id].value = clamp_value;
+	uc_map[group_id].tg_count = 0;
+}
+
+/**
+ * uclamp_group_reset: reset a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @group_id: the group index to release
+ *
+ * A clamp group can be reset every time there are no more TGs using the
+ * clamp value it maps for a given clamp index.
+ */
+static inline void uclamp_group_reset(int clamp_id, int group_id)
+{
+	uclamp_group_init(clamp_id, group_id, UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_find: find the group index of a utilization clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @clamp_value: the utilization clamping value lookup for
+ *
+ * Verify if a group has been assigned to a certain clamp value and return
+ * its index to be used for accounting.
+ *
+ * Since only a limited number of utilization clamp groups are allowed, if no
+ * groups have been assigned for the specified value, a new group is assigned
+ * if possible. Otherwise an error is returned, meaning that a different clamp
+ * value is not (currently) supported.
+ */
+static int
+uclamp_group_find(int clamp_id, unsigned int clamp_value)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	int free_group_id = UCLAMP_NONE;
+	unsigned int group_id = 0;
+
+	for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+		/* Keep track of first free clamp group */
+		if (uclamp_group_available(clamp_id, group_id)) {
+			if (free_group_id == UCLAMP_NONE)
+				free_group_id = group_id;
+			continue;
+		}
+		/* Return index of first group with same clamp value */
+		if (uc_map[group_id].value == clamp_value)
+			return group_id;
+	}
+	/* Default to first free clamp group */
+	if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
+		group_id = free_group_id;
+	/* All clamp group already tracking different clamp values */
+	if (group_id == UCLAMP_NONE)
+		return -ENOSPC;
+	return group_id;
+}
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @uc_tg: the utilization clamp data for that task group
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value. A clamp group is
+ * released when there are no more task groups referencing its clamp value.
+ */
+static inline void uclamp_group_put(int clamp_id, int group_id)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	unsigned long flags;
+
+	/* Ignore TG's not yet attached */
+	if (group_id == UCLAMP_NONE)
+		return;
+
+	/* Remove TG from this clamp group */
+	raw_spin_lock_irqsave(&uc_map[group_id].tg_lock, flags);
+	uc_map[group_id].tg_count -= 1;
+	if (uc_map[group_id].tg_count == 0)
+		uclamp_group_reset(clamp_id, group_id);
+	raw_spin_unlock_irqrestore(&uc_map[group_id].tg_lock, flags);
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @css: reference to the task group to account
+ * @clamp_id: the clamp index affected by the task group
+ * @uc_tg: the utilization clamp data for the task group
+ * @clamp_value: the new clamp value for the task group
+ *
+ * Each time a task group changes the utilization clamp value, for a specified
+ * clamp index, we need to find an available clamp group which can be used
+ * to track its new clamp value. The corresponding clamp group index will be
+ * used by tasks in this task group to reference count the clamp value on CPUs
+ * where they are enqueued.
+ *
+ * Return: -ENOSPC if there are not available clamp groups, 0 on success.
+ */
+static inline int uclamp_group_get(struct cgroup_subsys_state *css,
+				   int clamp_id, struct uclamp_tg *uc_tg,
+				   unsigned int clamp_value)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	int prev_group_id = uc_tg->group_id;
+	int next_group_id = UCLAMP_NONE;
+	unsigned long flags;
+
+	/* Lookup for a usable utilization clamp group */
+	next_group_id = uclamp_group_find(clamp_id, clamp_value);
+	if (next_group_id < 0) {
+		pr_err("Cannot allocate more than %d utilization clamp groups\n",
+		       CONFIG_UCLAMP_GROUPS_COUNT);
+		return -ENOSPC;
+	}
+
+	/* Allocate new clamp group for this clamp value */
+	if (uclamp_group_available(clamp_id, next_group_id))
+		uclamp_group_init(clamp_id, next_group_id, clamp_value);
+
+	/* Update TG's clamp values and attach it to new clamp group */
+	raw_spin_lock_irqsave(&uc_map[next_group_id].tg_lock, flags);
+	uc_tg->value = clamp_value;
+	uc_tg->group_id = next_group_id;
+	uc_map[next_group_id].tg_count += 1;
+	raw_spin_unlock_irqrestore(&uc_map[next_group_id].tg_lock, flags);
+
+	/* Release the previous clamp group */
+	uclamp_group_put(clamp_id, prev_group_id);
+
+	return 0;
+}
+
 /**
  * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
  * @tg: the newly created task group
@@ -764,14 +990,52 @@ static DEFINE_MUTEX(uclamp_mutex);
  *
  * A newly created task group inherits its utilization clamp values, for all
  * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: !0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+					   struct task_group *parent)
+{
+	struct uclamp_tg *uc_tg;
+	int clamp_id;
+	int ret = 1;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_tg = &tg->uclamp[clamp_id];
+
+		uc_tg->value = parent->uclamp[clamp_id].value;
+		uc_tg->group_id = UCLAMP_NONE;
+
+		if (uclamp_group_get(NULL, clamp_id, uc_tg,
+				     parent->uclamp[clamp_id].value)) {
+			ret = 0;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
+/**
+ * release_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
  */
-static inline void alloc_uclamp_sched_group(struct task_group *tg,
-					    struct task_group *parent)
+static inline void free_uclamp_sched_group(struct task_group *tg)
 {
+	struct uclamp_tg *uc_tg;
 	int clamp_id;
 
-	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
-		tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_tg = &tg->uclamp[clamp_id];
+		uclamp_group_put(clamp_id, uc_tg->group_id);
+	}
 }
 
 /**
@@ -779,17 +1043,49 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
  */
 static inline void init_uclamp(void)
 {
+	struct uclamp_map *uc_map;
+	struct uclamp_tg *uc_tg;
+	int group_id;
 	int clamp_id;
 
 	mutex_init(&uclamp_mutex);
 
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_map = &uclamp_maps[clamp_id][0];
+		/* Init TG's clamp map */
+		group_id = 0;
+		for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+			uc_map[group_id].value = UCLAMP_NONE;
+			raw_spin_lock_init(&uc_map[group_id].tg_lock);
+		}
+	}
+
+	/* Root TG's are initialized to the first clamp group */
+	group_id = 0;
+
 	/* Initialize root TG's to default (none) clamp values */
-	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
-		root_task_group.uclamp[clamp_id] = uclamp_none(clamp_id);
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_map = &uclamp_maps[clamp_id][0];
+
+		/* Map root TG's clamp value */
+		uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
+
+		/* Init root TG's clamp group */
+		uc_tg = &root_task_group.uclamp[clamp_id];
+		uc_tg->value = uclamp_none(clamp_id);
+		uc_tg->group_id = group_id;
+
+		/* Attach root TG's clamp group */
+		uc_map[group_id].tg_count = 1;
+	}
 }
 #else
-static inline void alloc_uclamp_sched_group(struct task_group *tg,
-					    struct task_group *parent) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+					   struct task_group *parent)
+{
+	return 1;
+}
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
 static inline void init_uclamp(void) { }
 #endif /* CONFIG_UTIL_CLAMP */
 
@@ -6122,6 +6418,7 @@ static DEFINE_SPINLOCK(task_group_lock);
 
 static void sched_free_group(struct task_group *tg)
 {
+	free_uclamp_sched_group(tg);
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
@@ -6143,7 +6440,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
-	alloc_uclamp_sched_group(tg, parent);
+	if (!alloc_uclamp_sched_group(tg, parent))
+		goto err;
 
 	return tg;
 
@@ -6370,6 +6668,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 				  struct cftype *cftype, u64 min_value)
 {
 	struct cgroup_subsys_state *pos;
+	struct uclamp_tg *uc_tg;
 	struct task_group *tg;
 	int ret = -EINVAL;
 
@@ -6382,29 +6681,29 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 	tg = css_tg(css);
 
 	/* Already at the required value */
-	if (tg->uclamp[UCLAMP_MIN] == min_value) {
+	if (tg->uclamp[UCLAMP_MIN].value == min_value) {
 		ret = 0;
 		goto out;
 	}
 
 	/* Ensure to not exceed the maximum clamp value */
-	if (tg->uclamp[UCLAMP_MAX] < min_value)
+	if (tg->uclamp[UCLAMP_MAX].value < min_value)
 		goto out;
 
 	/* Ensure min clamp fits within parent's clamp value */
 	if (tg->parent &&
-	    tg->parent->uclamp[UCLAMP_MIN] > min_value)
+	    tg->parent->uclamp[UCLAMP_MIN].value > min_value)
 		goto out;
 
 	/* Ensure each child is a restriction of this TG */
 	css_for_each_child(pos, css) {
-		if (css_tg(pos)->uclamp[UCLAMP_MIN] < min_value)
+		if (css_tg(pos)->uclamp[UCLAMP_MIN].value < min_value)
 			goto out;
 	}
 
-	/* Update TG's utilization clamp */
-	tg->uclamp[UCLAMP_MIN] = min_value;
-	ret = 0;
+	/* Update TG's reference count */
+	uc_tg = &tg->uclamp[UCLAMP_MIN];
+	ret = uclamp_group_get(css, UCLAMP_MIN, uc_tg, min_value);
 
 out:
 	rcu_read_unlock();
@@ -6417,6 +6716,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 				  struct cftype *cftype, u64 max_value)
 {
 	struct cgroup_subsys_state *pos;
+	struct uclamp_tg *uc_tg;
 	struct task_group *tg;
 	int ret = -EINVAL;
 
@@ -6429,29 +6729,29 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	tg = css_tg(css);
 
 	/* Already at the required value */
-	if (tg->uclamp[UCLAMP_MAX] == max_value) {
+	if (tg->uclamp[UCLAMP_MAX].value == max_value) {
 		ret = 0;
 		goto out;
 	}
 
 	/* Ensure to not go below the minimum clamp value */
-	if (tg->uclamp[UCLAMP_MIN] > max_value)
+	if (tg->uclamp[UCLAMP_MIN].value > max_value)
 		goto out;
 
 	/* Ensure max clamp fits within parent's clamp value */
 	if (tg->parent &&
-	    tg->parent->uclamp[UCLAMP_MAX] < max_value)
+	    tg->parent->uclamp[UCLAMP_MAX].value < max_value)
 		goto out;
 
 	/* Ensure each child is a restriction of this TG */
 	css_for_each_child(pos, css) {
-		if (css_tg(pos)->uclamp[UCLAMP_MAX] > max_value)
+		if (css_tg(pos)->uclamp[UCLAMP_MAX].value > max_value)
 			goto out;
 	}
 
-	/* Update TG's utilization clamp */
-	tg->uclamp[UCLAMP_MAX] = max_value;
-	ret = 0;
+	/* Update TG's reference count */
+	uc_tg = &tg->uclamp[UCLAMP_MAX];
+	ret = uclamp_group_get(css, UCLAMP_MAX, uc_tg, max_value);
 
 out:
 	rcu_read_unlock();
@@ -6468,7 +6768,7 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
 
 	rcu_read_lock();
 	tg = css_tg(css);
-	util_clamp = tg->uclamp[clamp_id];
+	util_clamp = tg->uclamp[clamp_id].value;
 	rcu_read_unlock();
 
 	return util_clamp;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 982340b8870b..869344de0396 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -290,6 +290,24 @@ struct cfs_bandwidth {
 #endif
 };
 
+/**
+ * Utilization's clamp group
+ *
+ * A utilization clamp group maps a "clamp value" (value), i.e.
+ * util_{min,max}, to a "clamp group index" (group_id).
+ *
+ * Thus, the same "group_id" is used by all the TG's which enforce the same
+ * clamp "value" for a given clamp index.
+ */
+struct uclamp_tg {
+	/* Utilization constraint for tasks in this group */
+	unsigned int value;
+	/* Utilization clamp group for this constraint */
+	unsigned int group_id;
+	/* No utilization clamp group assigned */
+#define UCLAMP_NONE	-1
+};
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -332,8 +350,9 @@ struct task_group {
 	struct cfs_bandwidth cfs_bandwidth;
 
 #ifdef CONFIG_UTIL_CLAMP
-	unsigned int uclamp[UCLAMP_CNT];
+	struct uclamp_tg uclamp[UCLAMP_CNT];
 #endif
+
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFCv4 3/6] sched/core: reference count active tasks's clamp groups
  2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 1/6] sched/core: add utilization clamping to " Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 2/6] sched/core: map cpu's task groups to clamp groups Patrick Bellasi
@ 2017-08-24 18:08 ` Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 4/6] sched/core: sync task_group's with CPU's " Patrick Bellasi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Indeed, the clamp value mapped by a clamp
group applies to a CPU only when there is at least one task active in
that clamp group.
Since each clamp group enforces a different utilization clamp value, once
the set of these groups changes it can be required to re-compute what is
the new "aggregated" clamp value to apply for that CPU.

Clamp values are always MAX aggregated for both util_min and util_max. This
is to ensure that no tasks can affect the performances of other
co-scheduled tasks which are either more boosted (i.e.  with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).

This patch introduces the required support to properly reference count
clamp groups at each task enqueue/dequeue time. The MAX aggregation of the
currently active clamp groups is implemented to minimizes the number of
times we need to scan the complete (unordered) clamp group array to
figure out the new max value.
This operation happens only when we dequeue last task of the clamp group
defining the current max clamp, and thus the CPU is either entering IDLE
or going to schedule a less boosted or more clamped task.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 include/linux/sched.h |   5 ++
 kernel/sched/core.c   | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  77 ++++++++++++++++++++++++
 3 files changed, 242 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 265ac0898f9e..5cf0ee6a1aee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -574,6 +574,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_UTIL_CLAMP
+	/* Index of clamp group the task has been accounted into */
+	int				uclamp_group_id[UCLAMP_CNT];
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* List of struct preempt_notifier: */
 	struct hlist_head		preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d39766f2b03..ba31bb4e14c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -850,9 +850,19 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
 				     unsigned int clamp_value)
 {
 	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	struct uclamp_cpu *uc_cpu;
+	int cpu;
 
+	/* Set clamp group map */
 	uc_map[group_id].value = clamp_value;
 	uc_map[group_id].tg_count = 0;
+
+	/* Set clamp groups on all CPUs */
+	for_each_possible_cpu(cpu) {
+		uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+		uc_cpu->group[group_id].value = clamp_value;
+		uc_cpu->group[group_id].tasks = 0;
+	}
 }
 
 /**
@@ -908,6 +918,110 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
 	return group_id;
 }
 
+/**
+ * uclamp_cpu_update: update the utilization clamp of a CPU
+ * @cpu: the CPU which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups is subject to change. Since each clamp group enforces a
+ * different utilization clamp value, once the set of these groups change it
+ * can be required to re-compute what is the new clamp value to apply for that
+ * CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of tasks active on that CPU.
+ */
+static inline void uclamp_cpu_update(int cpu, int clamp_id)
+{
+	struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+	int max_value = UCLAMP_NONE;
+	unsigned int group_id;
+
+	for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+
+		/* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
+		if (!uclamp_group_active(uc_cpu, group_id))
+			continue;
+
+		/* Both min and max clamp are MAX aggregated */
+		max_value = max(max_value, uc_cpu->group[group_id].value);
+
+		/* Stop if we reach the max possible clamp */
+		if (max_value >= SCHED_CAPACITY_SCALE)
+			break;
+	}
+	uc_cpu->value = max_value;
+}
+
+/**
+ * uclamp_cpu_get(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @cpu: the CPU where the clamp group has to be reference counted
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
+ *
+ * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
+ * the task's TG::uclamp.group_id is reference counted on that CPU.
+ * We keep track of the reference counted clamp group by storing its index
+ * (group_id) into the task's task_struct::uclamp_group_id, which will then be
+ * used at task's dequeue time to release the reference count.
+ */
+static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
+{
+	struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+	int clamp_value = task_group(p)->uclamp[clamp_id].value;
+	int group_id;
+
+	/* Increment the current TG's group_id */
+	group_id = task_group(p)->uclamp[clamp_id].group_id;
+	uc_cpu->group[group_id].tasks += 1;
+
+	/* Mark task as enqueued for this clamp IDX */
+	p->uclamp_group_id[clamp_id] = group_id;
+
+	/*
+	 * If this is the new max utilization clamp value, then
+	 * we can update straight away the CPU clamp value.
+	 */
+	if (uc_cpu->value < clamp_value)
+		uc_cpu->value = clamp_value;
+}
+
+/**
+ * uclamp_cpu_put(): decrease reference count for a clamp groups on a CPU
+ * @p: the task being dequeued from a CPU
+ * @cpu: the CPU from where the clamp group has to be released
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
+ *
+ * When a task is dequeued from a CPU's RQ, the clamp group reference counted
+ * by the task's task_struct::uclamp_group_id is decrease for that CPU.
+ */
+static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
+{
+	struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+	unsigned int clamp_value;
+	int group_id;
+
+	/* Decrement the task's reference counted group index */
+	group_id = p->uclamp_group_id[clamp_id];
+	uc_cpu->group[group_id].tasks -= 1;
+
+	/* Mark task as dequeued for this clamp IDX */
+	p->uclamp_group_id[clamp_id] = UCLAMP_NONE;
+
+	/* If this is not the last task, no updates are required */
+	if (uc_cpu->group[group_id].tasks > 0)
+		return;
+
+	/*
+	 * Update the CPU only if this was the last task of the group
+	 * defining the current clamp value.
+	 */
+	clamp_value = uc_cpu->group[group_id].value;
+	if (clamp_value >= uc_cpu->value)
+		uclamp_cpu_update(cpu, clamp_id);
+}
+
 /**
  * uclamp_group_put: decrease the reference count for a clamp group
  * @clamp_id: the clamp index which was affected by a task group
@@ -983,6 +1097,38 @@ static inline int uclamp_group_get(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+/**
+ * uclamp_task_update: update clamp group referenced by a task
+ * @rq: the RQ the task is going to be enqueued/dequeued to/from
+ * @p: the task being enqueued/dequeued
+ *
+ * Utilization clamp constraints for a CPU depend on tasks which are active
+ * (i.e. RUNNABLE or RUNNING) on that CPU. To keep track of tasks
+ * requirements, each active task reference counts a clamp group in the CPU
+ * they are currently queued for execution.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
+{
+	int cpu = cpu_of(rq);
+	int clamp_id;
+
+	/* The idle task is never clamped */
+	if (unlikely(p->sched_class == &idle_sched_class))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		if (uclamp_task_affects(p, clamp_id))
+			uclamp_cpu_put(p, cpu, clamp_id);
+		else
+			uclamp_cpu_get(p, cpu, clamp_id);
+	}
+}
+
 /**
  * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
  * @tg: the newly created task group
@@ -1043,10 +1189,12 @@ static inline void free_uclamp_sched_group(struct task_group *tg)
  */
 static inline void init_uclamp(void)
 {
+	struct uclamp_cpu *uc_cpu;
 	struct uclamp_map *uc_map;
 	struct uclamp_tg *uc_tg;
 	int group_id;
 	int clamp_id;
+	int cpu;
 
 	mutex_init(&uclamp_mutex);
 
@@ -1058,6 +1206,11 @@ static inline void init_uclamp(void)
 			uc_map[group_id].value = UCLAMP_NONE;
 			raw_spin_lock_init(&uc_map[group_id].tg_lock);
 		}
+		/* Init CPU's clamp groups */
+		for_each_possible_cpu(cpu) {
+			uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+			memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
+		}
 	}
 
 	/* Root TG's are initialized to the first clamp group */
@@ -1080,6 +1233,7 @@ static inline void init_uclamp(void)
 	}
 }
 #else
+static inline void uclamp_task_update(struct rq *rq, struct task_struct *p) { }
 static inline int alloc_uclamp_sched_group(struct task_group *tg,
 					   struct task_group *parent)
 {
@@ -1097,6 +1251,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
+	uclamp_task_update(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -1108,6 +1263,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
 
+	uclamp_task_update(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2499,6 +2655,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.cfs_rq			= NULL;
 #endif
 
+#ifdef CONFIG_UTIL_CLAMP
+	memset(&p->uclamp_group_id, UCLAMP_NONE, sizeof(p->uclamp_group_id));
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 869344de0396..b0f17c19c0f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,42 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
 extern int tg_nop(struct task_group *tg, void *data);
 
 #ifdef CONFIG_UTIL_CLAMP
+/**
+ * Utilization clamp Group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+	/* Utilization clamp value for tasks on this clamp group */
+	int value;
+	/* Number of RUNNABLE tasks on this clamp group */
+	int tasks;
+};
+
+/**
+ * CPU's utilization clamp
+ *
+ * Keep track of active tasks on a CPUs to aggregate their clamp values.  A
+ * clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we wanna run the CPU at least at the max of the minimum
+ *   utilization required by its currently active tasks.
+ * - for util_max: we wanna allow the CPU to run up to the max of the
+ *   maximum utilization allowed by its currently active tasks.
+ *
+ * Since on each system we expect only a limited number of utilization clamp
+ * values, we can use a simple array to track the metrics required to compute
+ * all the per-CPU utilization clamp values.
+ */
+struct uclamp_cpu {
+	/* Utilization clamp value for a CPU */
+	int value;
+	/* Utilization clamp groups affecting this CPU */
+	struct uclamp_group group[CONFIG_UCLAMP_GROUPS_COUNT + 1];
+};
+
 /**
  * uclamp_none: default value for a clamp
  *
@@ -404,6 +440,44 @@ static inline unsigned int uclamp_none(int clamp_id)
 		return 0;
 	return SCHED_CAPACITY_SCALE;
 }
+
+/**
+ * uclamp_task_affects: check if a task affects a utilization clamp
+ * @p: the task to consider
+ * @clamp_id: the utilization clamp to check
+ *
+ * A task affects a clamp index if its task_struct::uclamp_group_id is a
+ * valid clamp group index for the specified clamp index.
+ * Once a task is dequeued from a CPU, its clamp group indexes are reset to
+ * UCLAMP_NONE. A valid clamp group index is assigned to a task only when it
+ * is RUNNABLE on a CPU and it represents the clamp group which is currently
+ * reference counted by that task.
+ *
+ * Return: true if p currently affects the specified clamp_id
+ */
+static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
+{
+	int task_group_id = p->uclamp_group_id[clamp_id];
+
+	return (task_group_id != UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_active: check if a clamp group is active on a CPU
+ * @uc_cpu: the array of clamp groups for a CPU
+ * @group_id: the clamp group to check
+ *
+ * A clamp group affects a CPU if it as at least one "active" task.
+ *
+ * Return: true if the specified CPU has at least one active task for
+ *         the specified clamp group.
+ */
+static inline bool uclamp_group_active(struct uclamp_cpu *uc_cpu, int group_id)
+{
+	return uc_cpu->group[group_id].tasks > 0;
+}
+#else
+struct uclamp_cpu { };
 #endif /* CONFIG_UTIL_CLAMP */
 
 extern void free_fair_sched_group(struct task_group *tg);
@@ -771,6 +845,9 @@ struct rq {
 	unsigned long cpu_capacity;
 	unsigned long cpu_capacity_orig;
 
+	/* util_{min,max} clamp values based on CPU's active tasks */
+	struct uclamp_cpu uclamp[UCLAMP_CNT];
+
 	struct callback_head *balance_callback;
 
 	unsigned char idle_balance;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFCv4 4/6] sched/core: sync task_group's with CPU's clamp groups
  2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
                   ` (2 preceding siblings ...)
  2017-08-24 18:08 ` [RFCv4 3/6] sched/core: reference count active tasks's " Patrick Bellasi
@ 2017-08-24 18:08 ` Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 5/6] cpufreq: schedutil: add util clamp for FAIR tasks Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 6/6] cpufreq: schedutil: add util clamp for RT/DL tasks Patrick Bellasi
  5 siblings, 0 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

The util_{min,max} clamp values for a task group are usually updated from
a user-space process (slow-path) but they require a synchronization
with the scheduler's (fast-path) maintained clamp group reference couters.

Indeed, each time the clamp value of a task group is changed, the old
and new clamp groups have to be updated for each CPU containing a
RUNNABLE task belonging to this tasks group. Non RUNNABLE tasks are not
updated since they will be enqueued with the proper clamp group index at
their next activation.

To properly update clamp group's reference counter of runnable tasks we
use the same locking schema use by __set_cpus_allowed_ptr().  This might
lock the (previous) RQ of a !RUNNABLE task, but that's the price to pay
to safely serialize util_{min,max} updates with RQ's enqueues, dequeues
and migration operations.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/core.c  | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h | 21 +++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba31bb4e14c7..e4ce25dbad6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -754,6 +754,12 @@ static void set_load_weight(struct task_struct *p)
 #ifdef CONFIG_UTIL_CLAMP
 /**
  * uclamp_mutex: serialize updates of TG's utilization clamp values
+ *
+ * A task groups's utilization clamp value update is usually triggered from a
+ * user-space process (slow-path) but it requires a synchronization with the
+ * scheduler's (fast-path) enqueue/dequeue operations.
+ * While the fast-path synchronization is protected by RQs spinlock, this
+ * mutex ensure that we sequentially serve user-space requests.
  */
 static DEFINE_MUTEX(uclamp_mutex);
 
@@ -1022,6 +1028,52 @@ static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
 		uclamp_cpu_update(cpu, clamp_id);
 }
 
+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ * @group_id: the clamp group to update
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups have to be updated for each CPU containing a RUNNABLE task belonging
+ * to this tasks group. Sleeping tasks are not updated since they will be
+ * enqueued with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * Lock the task and the CPU where the task is (or was) queued.
+	 *
+	 * We might lock the (previous) RQ of a !RUNNABLE task, but that's the
+	 * price to pay to safely serialize util_{min,max} updates with
+	 * enqueues, dequeues and migration operations.
+	 * This is the same locking schema used by __set_cpus_allowed_ptr().
+	 */
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * The setting of the clamp group is serialized by task_rq_lock().
+	 * Thus, if the task's task_struct is not referencing a valid group
+	 * index, then that task is not yet RUNNABLE or it's going to be
+	 * enqueued with the proper clamp group value.
+	 */
+	if (!uclamp_task_active(p))
+		goto done;
+
+	/* Release p's currently referenced clamp group */
+	uclamp_cpu_put(p, task_cpu(p), clamp_id);
+
+	/* Get p's new clamp group */
+	uclamp_cpu_get(p, task_cpu(p), clamp_id);
+
+done:
+	task_rq_unlock(rq, p, &rf);
+}
+
 /**
  * uclamp_group_put: decrease the reference count for a clamp group
  * @clamp_id: the clamp index which was affected by a task group
@@ -1070,6 +1122,8 @@ static inline int uclamp_group_get(struct cgroup_subsys_state *css,
 	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
 	int prev_group_id = uc_tg->group_id;
 	int next_group_id = UCLAMP_NONE;
+	struct css_task_iter it;
+	struct task_struct *p;
 	unsigned long flags;
 
 	/* Lookup for a usable utilization clamp group */
@@ -1091,6 +1145,18 @@ static inline int uclamp_group_get(struct cgroup_subsys_state *css,
 	uc_map[next_group_id].tg_count += 1;
 	raw_spin_unlock_irqrestore(&uc_map[next_group_id].tg_lock, flags);
 
+	/* Newly created TG don't have tasks assigned */
+	if (!css)
+		goto release;
+
+	/* Update clamp groups for RUNNABLE tasks in this TG */
+	css_task_iter_start(css, &it);
+	while ((p = css_task_iter_next(&it)))
+		uclamp_task_update_active(p, clamp_id, next_group_id);
+	css_task_iter_end(&it);
+
+release:
+
 	/* Release the previous clamp group */
 	uclamp_group_put(clamp_id, prev_group_id);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b0f17c19c0f6..164a8ac152b3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -462,6 +462,27 @@ static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
 	return (task_group_id != UCLAMP_NONE);
 }
 
+/**
+ * uclamp_task_active: check if a task is currently clamping a CPU
+ * @p: the task to check
+ *
+ * A task affects the utilization clamp of a CPU if it references a valid
+ * clamp group index for at least one clamp index.
+ *
+ * Return: true if p is currently clamping the utilization of its CPU.
+ */
+static inline bool uclamp_task_active(struct task_struct *p)
+{
+	int clamp_id;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		if (uclamp_task_affects(p, clamp_id))
+			return true;
+	}
+
+	return false;
+}
+
 /**
  * uclamp_group_active: check if a clamp group is active on a CPU
  * @uc_cpu: the array of clamp groups for a CPU
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFCv4 5/6] cpufreq: schedutil: add util clamp for FAIR tasks
  2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
                   ` (3 preceding siblings ...)
  2017-08-24 18:08 ` [RFCv4 4/6] sched/core: sync task_group's with CPU's " Patrick Bellasi
@ 2017-08-24 18:08 ` Patrick Bellasi
  2017-08-24 18:08 ` [RFCv4 6/6] cpufreq: schedutil: add util clamp for RT/DL tasks Patrick Bellasi
  5 siblings, 0 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

Each time a frequency update is required via schedutil, we must grant
the util_{min,max} constraints enforced in the current CPU by its set of
currently RUNNABLE tasks.

This patch adds the required support to clamp the utilization generated
by FAIR tasks within the boundaries defined by their aggregated
utilization clamp constraints.
The clamped utilization is then used to select the frequency thus
allowing, for example, to:
 - boost tasks which are directly affecting the user experience
   by running them at least at a minimum "required" frequency
 - cap low priority tasks not directly affecting the user experience
   by running them only up to a maximum "allowed" frequency

The default values for boosting and capping are defined to be:
 - util_min: 0
 - util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 33 ++++++++++++++++++++++
 kernel/sched/sched.h             | 60 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 29a397067ffa..f67c26bbade4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -231,6 +231,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	} else {
 		sugov_get_util(&util, &max);
 		sugov_iowait_boost(sg_cpu, &util, &max);
+		util = uclamp_util(smp_processor_id(), util);
 		next_f = get_next_freq(sg_policy, util, max);
 		/*
 		 * Do not reduce the frequency if the CPU has not been idle
@@ -246,9 +247,18 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 {
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	struct cpufreq_policy *policy = sg_policy->policy;
+	unsigned long max_util, min_util;
 	unsigned long util = 0, max = 1;
 	unsigned int j;
 
+	/* Initialize clamp values based on caller CPU constraints */
+	if (uclamp_enabled) {
+		int cpu = smp_processor_id();
+
+		max_util = uclamp_value(cpu, UCLAMP_MAX);
+		min_util = uclamp_value(cpu, UCLAMP_MIN);
+	}
+
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
 		unsigned long j_util, j_max;
@@ -277,8 +287,31 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 		}
 
 		sugov_iowait_boost(j_sg_cpu, &util, &max);
+
+		/*
+		 * Update clamping range based on j-CPUs constraints, but only
+		 * if active. Idle CPUs do not enforce constraints in a shared
+		 * frequency domain.
+		 */
+		if (uclamp_enabled && !idle_cpu(j)) {
+			unsigned long j_max_util, j_min_util;
+
+			j_max_util = uclamp_value(j, UCLAMP_MAX);
+			j_min_util = uclamp_value(j, UCLAMP_MIN);
+
+			/*
+			 * Clamp values are MAX aggregated among all the
+			 * different CPUs in the shared frequency domain.
+			 */
+			max_util = max(max_util, j_max_util);
+			min_util = max(min_util, j_min_util);
+		}
 	}
 
+	/* Clamp utilization based on aggregated uclamp constraints */
+	if (uclamp_enabled)
+		util = clamp(util, min_util, max_util);
+
 	return get_next_freq(sg_policy, util, max);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 164a8ac152b3..4a235c4a0762 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2224,6 +2224,66 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags) {}
 #endif /* CONFIG_CPU_FREQ */
 
+#ifdef CONFIG_UTIL_CLAMP
+/* Enable clamping code at compile time by constant propagation */
+#define uclamp_enabled true
+
+/**
+ * uclamp_value: get the current CPU's utilization clamp value
+ * @cpu: the CPU to consider
+ * @clamp_id: the utilization clamp index (i.e. min or max utilization)
+ *
+ * The utilization clamp value for a CPU depends on its set of currently
+ * active tasks and their specific util_{min,max} constraints.
+ * A max aggregated value is tracked for each CPU and returned by this
+ * function. An IDLE CPU never enforces a clamp value.
+ *
+ * Return: the current value for the specified CPU and clamp index
+ */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+	struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+	int clamp_value = uclamp_none(clamp_id);
+
+	/* Update min utilization clamp */
+	if (uc_cpu->value != UCLAMP_NONE)
+		clamp_value = uc_cpu->value;
+
+	return clamp_value;
+}
+
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @cpu: the CPU to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently active tasks. Given a utilization signal, i.e a signal in the
+ * [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline int uclamp_util(unsigned int cpu, unsigned int util)
+{
+	unsigned int min_util = uclamp_value(cpu, UCLAMP_MIN);
+	unsigned int max_util = uclamp_value(cpu, UCLAMP_MAX);
+
+	return clamp(util, min_util, max_util);
+}
+#else
+/* Disable clamping code at compile time by constant propagation */
+#define uclamp_enabled false
+#define uclamp_util(cpu, util) util
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+	if (clamp_id == UCLAMP_MIN)
+		return 0;
+	return SCHED_CAPACITY_SCALE;
+}
+#endif /* CONFIG_UTIL_CLAMP */
+
 #ifdef arch_scale_freq_capacity
 #ifndef arch_scale_freq_invariant
 #define arch_scale_freq_invariant()	(true)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFCv4 6/6] cpufreq: schedutil: add util clamp for RT/DL tasks
  2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
                   ` (4 preceding siblings ...)
  2017-08-24 18:08 ` [RFCv4 5/6] cpufreq: schedutil: add util clamp for FAIR tasks Patrick Bellasi
@ 2017-08-24 18:08 ` Patrick Bellasi
  5 siblings, 0 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-08-24 18:08 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Tim Murray, Todd Kjos,
	Andres Oportus, Joel Fernandes, Viresh Kumar

Currently schedutil enforces a maximum frequency when RT/DL tasks are
RUNNABLE.  Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT/DL workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.

This patch extends the usage of util_{min,max} to the RT/DL classes.
Whenever a task in these classes is RUNNABLE, the util required is
defined by the constraints of the CPU control group the task belongs to.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index f67c26bbade4..feca60c107bc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -227,7 +227,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	busy = sugov_cpu_is_busy(sg_cpu);
 
 	if (flags & SCHED_CPUFREQ_RT_DL) {
-		next_f = policy->cpuinfo.max_freq;
+		util = uclamp_util(smp_processor_id(), SCHED_CAPACITY_SCALE);
+		next_f = (uclamp_enabled && util < SCHED_CAPACITY_SCALE)
+			? get_next_freq(sg_policy, util, policy->cpuinfo.max_freq)
+			: policy->cpuinfo.max_freq;
 	} else {
 		sugov_get_util(&util, &max);
 		sugov_iowait_boost(sg_cpu, &util, &max);
@@ -276,10 +279,15 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 			j_sg_cpu->iowait_boost = 0;
 			continue;
 		}
-		if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
-			return policy->cpuinfo.max_freq;
 
-		j_util = j_sg_cpu->util;
+		if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) {
+			if (!uclamp_enabled)
+				return policy->cpuinfo.max_freq;
+			j_util = uclamp_util(j, SCHED_CAPACITY_SCALE);
+		} else {
+			j_util = j_sg_cpu->util;
+		}
+
 		j_max = j_sg_cpu->max;
 		if (j_util * max > j_max * util) {
 			util = j_util;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFCv4 1/6] sched/core: add utilization clamping to CPU controller
  2017-08-24 18:08 ` [RFCv4 1/6] sched/core: add utilization clamping to " Patrick Bellasi
@ 2017-08-28 18:23   ` Tejun Heo
  2017-09-04 17:25     ` Patrick Bellasi
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2017-08-28 18:23 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Tim Murray,
	Todd Kjos, Andres Oportus, Joel Fernandes, Viresh Kumar

Hello,

No idea whether this makes sense overall.  I'll just comment on the
cgroup interface part.

On Thu, Aug 24, 2017 at 07:08:52PM +0100, Patrick Bellasi wrote:
> This patch extends the CPU controller by adding a couple of new attributes,
> util_min and util_max, which can be used to enforce frequency boosting and
> capping. Specifically:
> 
> - util_min: defines the minimum CPU utilization which should be considered,
> 	    e.g. when  schedutil selects the frequency for a CPU while a
> 	    task in this group is RUNNABLE.
> 	    i.e. the task will run at least at a minimum frequency which
> 	         corresponds to the min_util utilization
> 
> - util_max: defines the maximum CPU utilization which should be considered,
> 	    e.g. when schedutil selects the frequency for a CPU while a
> 	    task in this group is RUNNABLE.
> 	    i.e. the task will run up to a maximum frequency which
> 	         corresponds to the max_util utilization

I'm not sure min/max are the right names here.  min/low/high/max are
used to designate guarantees and limits on resources and the above is
more restricting the range of an attribute.  I'll think more about
what'd be better names here.

> These attributes:
> a) are tunable at all hierarchy levels, i.e. at root group level too, thus
>    allowing to defined minimum and maximum frequency constraints for all
>    otherwise non-classified tasks (e.g. autogroups)

The problem with doing the above is two-fold.

1. The feature becomes inaccessible without cgroup even though it
   doesn't have much to do with cgroup at system level.

2. For the above and other historical reasons, most other features
   have a separate way to configure at the system level.

I think it'd be better to keep the root level control outside cgorup.

> b) allow to create subgroups of tasks which are not violating the
>    utilization constraints defined by the parent group.

The problem with doing the above is that it ties the configs of a
cgroup with its ancestors and that gets weird across delegation
boundaries.  Other resource knobs don't behave this way - a descendant
cgroup can have any memory.low/high/max values and an ancestor
changing config doesn't destory its descendants' configs.  Please
follow the same convention.

> Tasks on a subgroup can only be more boosted and/or capped, which is
> matching with the "limits" schema proposed by the "Resource Distribution
> Model (RDM)" suggested by the CGroups v2 documentation:
>    Documentation/cgroup-v2.txt

So, the guarantee side (min / low) shouldn't allow the descendants to
have more.  ie. if memory.low is 512M at the parent, its children can
never have more than 512M of low protection.  Given that "boosting"
means more CPU consumption, I think it'd make more sense to follow
such semantics - ie. a descendant cannot have higher boosting than the
lowest of its ancestors.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFCv4 1/6] sched/core: add utilization clamping to CPU controller
  2017-08-28 18:23   ` Tejun Heo
@ 2017-09-04 17:25     ` Patrick Bellasi
  0 siblings, 0 replies; 9+ messages in thread
From: Patrick Bellasi @ 2017-09-04 17:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Tim Murray,
	Todd Kjos, Andres Oportus, Joel Fernandes, Viresh Kumar

On 28-Aug 11:23, Tejun Heo wrote:
> Hello,

Hi Teo,

> No idea whether this makes sense overall.  I'll just comment on the
> cgroup interface part.

Thanks for the feedback, some comments follow inline...


> On Thu, Aug 24, 2017 at 07:08:52PM +0100, Patrick Bellasi wrote:
> > This patch extends the CPU controller by adding a couple of new attributes,
> > util_min and util_max, which can be used to enforce frequency boosting and
> > capping. Specifically:
> > 
> > - util_min: defines the minimum CPU utilization which should be considered,
> > 	    e.g. when  schedutil selects the frequency for a CPU while a
> > 	    task in this group is RUNNABLE.
> > 	    i.e. the task will run at least at a minimum frequency which
> > 	         corresponds to the min_util utilization
> > 
> > - util_max: defines the maximum CPU utilization which should be considered,
> > 	    e.g. when schedutil selects the frequency for a CPU while a
> > 	    task in this group is RUNNABLE.
> > 	    i.e. the task will run up to a maximum frequency which
> > 	         corresponds to the max_util utilization
> 
> I'm not sure min/max are the right names here.  min/low/high/max are
> used to designate guarantees and limits on resources and the above is
> more restricting the range of an attribute.  I'll think more about
> what'd be better names here.

You right, these are used mainly for range restrictions, but still:
- utilization is the measure of a resource, i.e. the CPU bandwidth
- to a certain extend we are still using them to designate a guarantee
  i.e. util_min guarantee that tasks will not run below a minimum
  frequency, while util_max guarantee that a task will never run (when
  alone on a CPU) above a certain frequency.

If this is still considered a too wake definition of guarantees,
what about something like util_{lower,upper}_bound?

> > These attributes:
> > a) are tunable at all hierarchy levels, i.e. at root group level too, thus
> >    allowing to defined minimum and maximum frequency constraints for all
> >    otherwise non-classified tasks (e.g. autogroups)
> 
> The problem with doing the above is two-fold.
> 
> 1. The feature becomes inaccessible without cgroup even though it
>    doesn't have much to do with cgroup at system level.

As I commented in the cover letter, we currently use CGroups as the
only interface just because so fare we identified sensible use-cases
where cgroups are required.

Android needs to classify tasks depending on their role in the system
to allocate them different resources depending on the run-time
scenario. Thus, cgroups is just the most natural interface to extend
to get the frequency boosting/capping support.

Not to speak about the, at least incomplete, cpu bandwidth controller
interface, which is currently defined based just on "elapsed time"
without accounting for the actual amount of computation performed on
systems where the frequency can be changed dynamically.

Nevertheless, the internal implementation allows for a different
(primary) interface whenever that should be required.

> 2. For the above and other historical reasons, most other features
>    have a separate way to configure at the system level.
>
> I think it'd be better to keep the root level control outside cgorup.

For this specific feature, the system level configuration using the
root control group allows to define the "default" behavior for tasks
not otherwise classified.

Considering also my comment on point 1 above, having a different API
for the system tuning would make the implementation more complex
without real benefits.


> > b) allow to create subgroups of tasks which are not violating the
> >    utilization constraints defined by the parent group.
> 
> The problem with doing the above is that it ties the configs of a
> cgroup with its ancestors and that gets weird across delegation
> boundaries.  Other resource knobs don't behave this way - a descendant
> cgroup can have any memory.low/high/max values and an ancestor
> changing config doesn't destory its descendants' configs.  Please
> follow the same convention.

Also in this implementation an ancestor config change cannot destroy
its descendants' config.

For example, if we have:

  group1/util_min = 10
  group1/child1/util_min = 20

we cannot set:

  group1/util_min = 30

Right now we just fails, since this will produce an inversion in the
parent/child constraints relationships (see below).

The right operations would be:

  group1/child1/util_min = 30, or more, and only after:
  group1/util_min = 30


> > Tasks on a subgroup can only be more boosted and/or capped, which is
> > matching with the "limits" schema proposed by the "Resource Distribution
> > Model (RDM)" suggested by the CGroups v2 documentation:
> >    Documentation/cgroup-v2.txt
> 
> So, the guarantee side (min / low) shouldn't allow the descendants to
> have more.  ie. if memory.low is 512M at the parent, its children can
> never have more than 512M of low protection.

Does that not means, more generically, that children are not allowed to
have "more relaxed" constraints then their parents?
IOW children can only be more constrained.

Here we are applying exactly the same rule, what change is just the
definition of "more relaxed" constraint.

For example, for frequency boosting: if a parent is 10% boosted,
then its children are not allowed to have a lower boost value because
this will relax their parent's boost constraint (see below).

> Given that "boosting"
> means more CPU consumption, I think it'd make more sense to follow
> such semantics - ie. a descendant cannot have higher boosting than the
> lowest of its ancestors.

>From a functional standpoint, what we want to avoid is that by
lowering the boost value of children we can (indirectly) affect the
"performance" of their ancestors.

That's why a more restrictive constraint implies that we allow only
higher boost value than the highest of its ancestors.

For frequency capping instead the logic is the opposite. In that case
the optimization goal is to constraint the maximum frequency, for
example to save energy. Thus, children are allowed only to set lower
util_max values.


> Thanks.
> 
> --
> tejun

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-04 17:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-24 18:08 [RFCv4 0/6] Add utilization clamping to the CPU controller Patrick Bellasi
2017-08-24 18:08 ` [RFCv4 1/6] sched/core: add utilization clamping to " Patrick Bellasi
2017-08-28 18:23   ` Tejun Heo
2017-09-04 17:25     ` Patrick Bellasi
2017-08-24 18:08 ` [RFCv4 2/6] sched/core: map cpu's task groups to clamp groups Patrick Bellasi
2017-08-24 18:08 ` [RFCv4 3/6] sched/core: reference count active tasks's " Patrick Bellasi
2017-08-24 18:08 ` [RFCv4 4/6] sched/core: sync task_group's with CPU's " Patrick Bellasi
2017-08-24 18:08 ` [RFCv4 5/6] cpufreq: schedutil: add util clamp for FAIR tasks Patrick Bellasi
2017-08-24 18:08 ` [RFCv4 6/6] cpufreq: schedutil: add util clamp for RT/DL tasks Patrick Bellasi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).