Linux-PM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v11 0/5] Add utilization clamping support (CGroups API)
@ 2019-07-08  8:43 Patrick Bellasi
  2019-07-08  8:43 ` [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller Patrick Bellasi
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-08  8:43 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

Hi all, this is a follow up of:

  https://lore.kernel.org/lkml/20190621084217.8167-1-patrick.bellasi@arm.com/

to respin all the bits not yet queued by PeterZ and addressing all Tejun's
requests from previous review:

 - remove checks for cpu_uclamp_{min,max}_write()s from root group
 - remove checks on "protections" being smaller then "limits"
 - rephrase uclamp extension description to avoid explicit
   mentioning of the bandwidth concept

the series is based on top of:

   tj/cgroup.git	for-5.3
   tip/tip.git		sched/core

I hope this version covers all major details about the expected behavior
and delegation model. The code however can still benefit from a better
review, looking forward for any additional feedback.

Cheers,
Patrick


Series Organization
===================

This series contains just the remaining bits of the original posting:

 - Patches [0-5]: Per task group (secondary) API

and it is based on today's tj/cgroup/for-5.3 and tip/sched/core.

The full tree is available here:

   git://linux-arm.org/linux-pb.git   lkml/utilclamp_v11
   http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v11

where you can also get the patches already queued in tip/sched/core

 - Patches [01-07]: Per task (primary) API
 - Patches [08-09]: Schedutil integration for FAIR and RT tasks
 - Patches [10-11]: Integration with EAS's energy_compute()


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [1].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilization values by themselves cannot convey the desired
power/performance behaviors of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorized user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

 - boosting: better interactive response for small tasks which
   are affecting the user experience.

   Consider for example the case of a small control thread for an external
   accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
   the scheduler does not have a complete view of what the task's requirements
   are and, if it's a small utilization task, it keeps selecting a more energy
   efficient CPU, with smaller capacity and lower frequency, thus negatively
   impacting the overall time required to complete task activations.

 - capping: increase energy efficiency for background tasks not affecting the
   user experience.

   Since running on a lower capacity CPU at a lower frequency is more energy
   efficient, when the completion time is not a main goal, then capping the
   utilization considered for certain (maybe big) tasks can have positive
   effects, both on energy consumption and thermal headroom.
   This feature allows also to make RT tasks more energy friendly on mobile
   systems where running them on high capacity CPUs and at the maximum
   frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in making tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

 - boosting: try to run small/foreground tasks on higher-capacity CPUs to
   complete them faster despite being less energy efficient.

 - capping: try to run big/background tasks on low-capacity CPUs to save power
   and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] Energy Aware Scheduling
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
    Linux Plumbers Conference 2018
    https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (5):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes

 Documentation/admin-guide/cgroup-v2.rst |  30 +++
 init/Kconfig                            |  22 ++
 kernel/sched/core.c                     | 335 +++++++++++++++++++++++-
 kernel/sched/sched.h                    |   8 +
 4 files changed, 392 insertions(+), 3 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller
  2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
@ 2019-07-08  8:43 ` Patrick Bellasi
  2019-07-08 11:08   ` Quentin Perret
  2019-07-18 14:52   ` Tejun Heo
  2019-07-08  8:43 ` [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps Patrick Bellasi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-08  8:43 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run at least at a
	      	 minimum frequency which corresponds to the uclamp.min
	      	 utilization

- uclamp.max: defines the maximum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run up to a
	      	 maximum frequency which corresponds to the uclamp.max
	      	 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing uclamp.min < uclamp.max.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v11:
 Message-ID: <20190624175215.GR657710@devbig004.ftw2.facebook.com>
 - remove checks for cpu_uclamp_{min,max}_write() from root group
 - remove enforcement for "protection" being smaller than "limits"
 - rephrase uclamp extension description to avoid explicit
   mentioning of the bandwidth concept
---
 Documentation/admin-guide/cgroup-v2.rst |  30 +++++
 init/Kconfig                            |  22 ++++
 kernel/sched/core.c                     | 161 +++++++++++++++++++++++-
 kernel/sched/sched.h                    |   6 +
 4 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a5c845338d6d..1d49426b4c1e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -1016,6 +1023,29 @@ All time durations are in microseconds.
 	Shows pressure stall information for CPU. See
 	Documentation/accounting/psi.txt for details.
 
+  cpu.uclamp.min
+        A read-write single value file which exists on non-root cgroups.
+        The default is "0", i.e. no utilization boosting.
+
+        The requested minimum utilization as a percentage rational number,
+        e.g. 12.34 for 12.34%.
+
+        This interface allows reading and setting minimum utilization clamp
+        values similar to the sched_setattr(2). This minimum utilization
+        value is used to clamp the task specific minimum utilization clamp.
+
+  cpu.uclamp.max
+        A read-write single value file which exists on non-root cgroups.
+        The default is "max". i.e. no utilization capping
+
+        The requested maximum utilization as a percentage rational number,
+        e.g. 98.76 for 98.76%.
+
+        This interface allows reading and setting maximum utilization clamp
+        values similar to the sched_setattr(2). This maximum utilization
+        value is used to clamp the task specific maximum utilization clamp.
+
+
 
 Memory
 ------
diff --git a/init/Kconfig b/init/Kconfig
index bf96faf3fe43..68a21188786c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -903,6 +903,28 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config UCLAMP_TASK_GROUP
+	bool "Utilization clamping per group of tasks"
+	depends on CGROUP_SCHED
+	depends on UCLAMP_TASK
+	default n
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+	  When this option is enabled, the user can specify a min and max
+	  CPU bandwidth which is allowed for each single task in a group.
+	  The max bandwidth allows to clamp the maximum frequency a task
+	  can use, while the min bandwidth allows to define a minimum
+	  frequency a task will always use.
+
+	  When task group based utilization clamping is enabled, an eventually
+	  specified task-specific clamp value is constrained by the cgroup
+	  specified clamp value. Both minimum and maximum task clamping cannot
+	  be bigger than the corresponding clamping defined at task group level.
+
+	  If in doubt, say N.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa43ce3962e7..17ebdaaf7cd9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1149,8 +1149,12 @@ static void __init init_uclamp(void)
 
 	/* System defaults allow max clamp values for both indexes */
 	uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
-	for_each_clamp_id(clamp_id)
+	for_each_clamp_id(clamp_id) {
 		uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+		root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+	}
 }
 
 #else /* CONFIG_UCLAMP_TASK */
@@ -6725,6 +6729,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+					    struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	int clamp_id;
+
+	for_each_clamp_id(clamp_id) {
+		uclamp_se_set(&tg->uclamp_req[clamp_id],
+			      uclamp_none(clamp_id), false);
+	}
+#endif
+}
+
 static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
@@ -6748,6 +6765,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	alloc_uclamp_sched_group(tg, parent);
+
 	return tg;
 
 err:
@@ -6968,6 +6987,118 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static inline int uclamp_scale_from_percent(char *buf, u64 *value)
+{
+	*value = SCHED_CAPACITY_SCALE;
+
+	buf = strim(buf);
+	if (strncmp("max", buf, 4)) {
+		s64 percent;
+		int ret;
+
+		ret = cgroup_parse_float(buf, 2, &percent);
+		if (ret)
+			return ret;
+
+		percent <<= SCHED_CAPACITY_SHIFT;
+		*value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
+	}
+
+	return 0;
+}
+
+static inline u64 uclamp_percent_from_scale(u64 value)
+{
+	return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
+}
+
+static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes,
+				    loff_t off)
+{
+	struct task_group *tg;
+	u64 min_value;
+	int ret;
+
+	ret = uclamp_scale_from_percent(buf, &min_value);
+	if (ret)
+		return ret;
+	if (min_value > SCHED_CAPACITY_SCALE)
+		return -ERANGE;
+
+	rcu_read_lock();
+
+	tg = css_tg(of_css(of));
+	if (tg->uclamp_req[UCLAMP_MIN].value != min_value)
+		uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
+
+	rcu_read_unlock();
+
+	return nbytes;
+}
+
+static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes,
+				    loff_t off)
+{
+	struct task_group *tg;
+	u64 max_value;
+	int ret;
+
+	ret = uclamp_scale_from_percent(buf, &max_value);
+	if (ret)
+		return ret;
+	if (max_value > SCHED_CAPACITY_SCALE)
+		return -ERANGE;
+
+	rcu_read_lock();
+
+	tg = css_tg(of_css(of));
+	if (tg->uclamp_req[UCLAMP_MAX].value != max_value)
+		uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
+
+	rcu_read_unlock();
+
+	return nbytes;
+}
+
+static inline void cpu_uclamp_print(struct seq_file *sf,
+				    enum uclamp_id clamp_id)
+{
+	struct task_group *tg;
+	u64 util_clamp;
+	u64 percent;
+	u32 rem;
+
+	rcu_read_lock();
+	tg = css_tg(seq_css(sf));
+	util_clamp = tg->uclamp_req[clamp_id].value;
+	rcu_read_unlock();
+
+	if (util_clamp == SCHED_CAPACITY_SCALE) {
+		seq_puts(sf, "max\n");
+		return;
+	}
+
+	percent = uclamp_percent_from_scale(util_clamp);
+	percent = div_u64_rem(percent, 100, &rem);
+	seq_printf(sf, "%llu.%u\n", percent, rem);
+}
+
+static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
+{
+	cpu_uclamp_print(sf, UCLAMP_MIN);
+	return 0;
+}
+
+static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
+{
+	cpu_uclamp_print(sf, UCLAMP_MAX);
+	return 0;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7312,6 +7443,20 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "uclamp.min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_min_show,
+		.write = cpu_uclamp_min_write,
+	},
+	{
+		.name = "uclamp.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_max_show,
+		.write = cpu_uclamp_max_write,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7479,6 +7624,20 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "uclamp.min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_min_show,
+		.write = cpu_uclamp_min_write,
+	},
+	{
+		.name = "uclamp.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_uclamp_max_show,
+		.write = cpu_uclamp_max_write,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..3723037ea80d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -393,6 +393,12 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	/* Clamp values requested for a task group */
+	struct uclamp_se	uclamp_req[UCLAMP_CNT];
+#endif
+
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps
  2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
  2019-07-08  8:43 ` [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller Patrick Bellasi
@ 2019-07-08  8:43 ` Patrick Bellasi
  2019-07-15 16:42   ` Michal Koutný
  2019-07-08  8:43 ` [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group Patrick Bellasi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-08  8:43 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection"
(i.e. min utilization) with the corresponding "limit" (i.e. max
utilization). Do that at effective clamps propagation to ensure all
user-space write never fails while still always tracking the most
restrictive values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v11:
 Message-ID: <20190624174607.GQ657710@devbig004.ftw2.facebook.com>
 - Removed user-space uclamp.{min.max}.effective API
 - Ensure group limits always clamps group protections
---
 kernel/sched/core.c  | 65 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 67 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17ebdaaf7cd9..ec91f4518752 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1137,6 +1149,8 @@ static void __init init_uclamp(void)
 	unsigned int clamp_id;
 	int cpu;
 
+	mutex_init(&uclamp_mutex);
+
 	for_each_possible_cpu(cpu) {
 		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
 		cpu_rq(cpu)->uclamp_flags = 0;
@@ -1153,6 +1167,7 @@ static void __init init_uclamp(void)
 		uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 		root_task_group.uclamp_req[clamp_id] = uc_max;
+		root_task_group.uclamp[clamp_id] = uc_max;
 #endif
 	}
 }
@@ -6738,6 +6753,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&tg->uclamp_req[clamp_id],
 			      uclamp_none(clamp_id), false);
+		tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
 	}
 #endif
 }
@@ -6988,6 +7004,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+	struct cgroup_subsys_state *top_css = css;
+	struct uclamp_se *uc_se = NULL;
+	unsigned int eff[UCLAMP_CNT];
+	unsigned int clamp_id;
+	unsigned int clamps;
+
+	css_for_each_descendant_pre(css, top_css) {
+		uc_se = css_tg(css)->parent
+			? css_tg(css)->parent->uclamp : NULL;
+
+		for_each_clamp_id(clamp_id) {
+			/* Assume effective clamps matches requested clamps */
+			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+			/* Cap effective clamps with parent's effective clamps */
+			if (uc_se &&
+			    eff[clamp_id] > uc_se[clamp_id].value) {
+				eff[clamp_id] = uc_se[clamp_id].value;
+			}
+		}
+		/* Ensure protection is always capped by limit */
+		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+		/* Propagate most restrictive effective clamps */
+		clamps = 0x0;
+		uc_se = css_tg(css)->uclamp;
+		for_each_clamp_id(clamp_id) {
+			if (eff[clamp_id] == uc_se[clamp_id].value)
+				continue;
+			uc_se[clamp_id].value = eff[clamp_id];
+			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
+			clamps |= (0x1 << clamp_id);
+		}
+		if (!clamps)
+			css = css_rightmost_descendant(css);
+	}
+}
+
 static inline int uclamp_scale_from_percent(char *buf, u64 *value)
 {
 	*value = SCHED_CAPACITY_SCALE;
@@ -7027,13 +7082,18 @@ static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
 	if (min_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(of_css(of));
 	if (tg->uclamp_req[UCLAMP_MIN].value != min_value)
 		uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_eff(of_css(of));
+
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return nbytes;
 }
@@ -7052,13 +7112,18 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
 	if (max_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(of_css(of));
 	if (tg->uclamp_req[UCLAMP_MAX].value != max_value)
 		uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_eff(of_css(of));
+
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return nbytes;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3723037ea80d..8c3aefdaf0ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@ struct task_group {
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	/* Clamp values requested for a task group */
 	struct uclamp_se	uclamp_req[UCLAMP_CNT];
+	/* Effective clamp values used for a task group */
+	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
 
 };
-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
  2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
  2019-07-08  8:43 ` [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller Patrick Bellasi
  2019-07-08  8:43 ` [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps Patrick Bellasi
@ 2019-07-08  8:43 ` Patrick Bellasi
  2019-07-15 16:42   ` Michal Koutný
  2019-07-08  8:43 ` [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps Patrick Bellasi
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-08  8:43 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec91f4518752..276f9c2f6103 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,12 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 		uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+	struct task_group *tg = &root_task_group;
+
+	uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
+		      sysctl_sched_uclamp_util_min, false);
+	uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
+		      sysctl_sched_uclamp_util_max, false);
+
+	cpu_util_update_eff(&root_task_group.css);
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				void __user *buffer, size_t *lenp,
 				loff_t *ppos)
 {
-	int old_min, old_max;
+	bool update_root_tg = false;
 	static DEFINE_MUTEX(mutex);
+	int old_min, old_max;
 	int result;
 
 	mutex_lock(&mutex);
@@ -1044,12 +1062,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	if (old_min != sysctl_sched_uclamp_util_min) {
 		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
 			      sysctl_sched_uclamp_util_min, false);
+		update_root_tg = true;
 	}
 	if (old_max != sysctl_sched_uclamp_util_max) {
 		uclamp_se_set(&uclamp_default[UCLAMP_MAX],
 			      sysctl_sched_uclamp_util_max, false);
+		update_root_tg = true;
 	}
 
+	if (update_root_tg)
+		uclamp_update_root_tg();
+
 	/*
 	 * Updating all the RUNNABLE task is expensive, keep it simple and do
 	 * just a lazy update at each next enqueue time.
-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
                   ` (2 preceding siblings ...)
  2019-07-08  8:43 ` [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group Patrick Bellasi
@ 2019-07-08  8:43 ` Patrick Bellasi
  2019-07-15 16:42   ` Michal Koutný
  2019-07-08  8:43 ` [PATCH v11 5/5] sched/core: uclamp: Update CPU's refcount on TG's clamp changes Patrick Bellasi
  2019-07-15 16:51 ` [PATCH v11 0/5] Add utilization clamping support (CGroups API) Michal Koutný
  5 siblings, 1 reply; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-08  8:43 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only system
defaults are enforced.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 276f9c2f6103..2591a70c85cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	struct uclamp_se uc_max;
+
+	/*
+	 * Tasks in autogroups or root task group will be
+	 * restricted by system defaults.
+	 */
+	if (task_group_is_autogroup(task_group(p)))
+		return uc_req;
+	if (task_group(p) == &root_task_group)
+		return uc_req;
+
+	uc_max = task_group(p)->uclamp[clamp_id];
+	if (uc_req.value > uc_max.value || !uc_req.user_defined)
+		return uc_max;
+#endif
+
+	return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
 	struct uclamp_se uc_max = uclamp_default[clamp_id];
 
 	/* System default restrictions always apply */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 5/5] sched/core: uclamp: Update CPU's refcount on TG's clamp changes
  2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
                   ` (3 preceding siblings ...)
  2019-07-08  8:43 ` [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps Patrick Bellasi
@ 2019-07-08  8:43 ` Patrick Bellasi
  2019-07-15 16:51 ` [PATCH v11 0/5] Add utilization clamping support (CGroups API) Michal Koutný
  5 siblings, 0 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-08  8:43 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v11:
 Message-ID: <20190624174607.GQ657710@devbig004.ftw2.facebook.com>
 - Ensure group limits always clamps group protection
---
 kernel/sched/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2591a70c85cf..ddc5fcd4b9cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 		uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * Lock the task and the rq where the task is (or was) queued.
+	 *
+	 * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+	 * price to pay to safely serialize util_{min,max} updates with
+	 * enqueues, dequeues and migration operations.
+	 * This is the same locking schema used by __set_cpus_allowed_ptr().
+	 */
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * Setting the clamp bucket is serialized by task_rq_lock().
+	 * If the task is not yet RUNNABLE and its task_struct is not
+	 * affecting a valid clamp bucket, the next time it's enqueued,
+	 * it will already see the updated clamp bucket value.
+	 */
+	if (!p->uclamp[clamp_id].active)
+		goto done;
+
+	uclamp_rq_dec_id(rq, p, clamp_id);
+	uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+	task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+			   unsigned int clamps)
+{
+	struct css_task_iter it;
+	struct task_struct *p;
+	unsigned int clamp_id;
+
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it))) {
+		for_each_clamp_id(clamp_id) {
+			if ((0x1 << clamp_id) & clamps)
+				uclamp_update_active(p, clamp_id);
+		}
+	}
+	css_task_iter_end(&it);
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7087,8 +7138,13 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
 			clamps |= (0x1 << clamp_id);
 		}
-		if (!clamps)
+		if (!clamps) {
 			css = css_rightmost_descendant(css);
+			continue;
+		}
+
+		/* Immediately update descendants RUNNABLE tasks */
+		uclamp_update_active_tasks(css, clamps);
 	}
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller
  2019-07-08  8:43 ` [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller Patrick Bellasi
@ 2019-07-08 11:08   ` Quentin Perret
  2019-07-15 13:38     ` Patrick Bellasi
  2019-07-18 14:52   ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Quentin Perret @ 2019-07-08 11:08 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

Hi Patrick,

On Monday 08 Jul 2019 at 09:43:53 (+0100), Patrick Bellasi wrote:
> +static inline int uclamp_scale_from_percent(char *buf, u64 *value)
> +{
> +	*value = SCHED_CAPACITY_SCALE;
> +
> +	buf = strim(buf);
> +	if (strncmp("max", buf, 4)) {
> +		s64 percent;
> +		int ret;
> +
> +		ret = cgroup_parse_float(buf, 2, &percent);
> +		if (ret)
> +			return ret;
> +
> +		percent <<= SCHED_CAPACITY_SHIFT;
> +		*value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline u64 uclamp_percent_from_scale(u64 value)
> +{
> +	return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
> +}

FWIW, I tried the patches and realized these conversions result in a
'funny' behaviour from a user's perspective. Things like this happen:

   $ echo 20 > cpu.uclamp.min
   $ cat cpu.uclamp.min
   20.2
   $ echo 20.2 > cpu.uclamp.min
   $ cat cpu.uclamp.min
   20.21

Having looked at the code, I get why this is happening, but I'm not sure
if a random user will. It's not an issue per se, but it's just a bit
weird.

I guess one way to fix this would be to revert back to having a
1024-scale for the cgroup interface too ... Though I understand Tejun
wanted % for consistency with other things.

So, I'm not sure if this is still up for discussion, but in any case I
wanted to say I support your original idea of using a 1024-scale for the
cgroups interface, since that would solve the 'issue' above and keeps
things consistent with the per-task API too.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller
  2019-07-08 11:08   ` Quentin Perret
@ 2019-07-15 13:38     ` Patrick Bellasi
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-15 13:38 UTC (permalink / raw)
  To: Quentin Perret, Peter Zijlstra, Tejun Heo
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Vincent Guittot, Viresh Kumar, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
	Steve Muckle, Suren Baghdasaryan, Alessio Balsini

On 08-Jul 12:08, Quentin Perret wrote:
> Hi Patrick,

Hi Quentin!

> On Monday 08 Jul 2019 at 09:43:53 (+0100), Patrick Bellasi wrote:
> > +static inline int uclamp_scale_from_percent(char *buf, u64 *value)
> > +{
> > +	*value = SCHED_CAPACITY_SCALE;
> > +
> > +	buf = strim(buf);
> > +	if (strncmp("max", buf, 4)) {
> > +		s64 percent;
> > +		int ret;
> > +
> > +		ret = cgroup_parse_float(buf, 2, &percent);
> > +		if (ret)
> > +			return ret;
> > +
> > +		percent <<= SCHED_CAPACITY_SHIFT;
> > +		*value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static inline u64 uclamp_percent_from_scale(u64 value)
> > +{
> > +	return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
> > +}
> 
> FWIW, I tried the patches and realized these conversions result in a
> 'funny' behaviour from a user's perspective. Things like this happen:
> 
>    $ echo 20 > cpu.uclamp.min
>    $ cat cpu.uclamp.min
>    20.2
>    $ echo 20.2 > cpu.uclamp.min
>    $ cat cpu.uclamp.min
>    20.21
> 
> Having looked at the code, I get why this is happening, but I'm not sure
> if a random user will. It's not an issue per se, but it's just a bit
> weird.

Yes, that's what we get if we need to use a "two decimal digit
precision percentage" to represent a 1024 range in kernel space.

I don't think the "percent <=> utilization" conversion code can be
made more robust. The only possible alternative I see to get back
exactly what we write in, is to store the actual request in kernel
space, alongside its conversion to the SCHED_CAPACITY_SCALE required by the
actual scheduler code.

Something along these lines (on top of what we have in this series):

---8<---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ddc5fcd4b9cf..82b28cfa5c3f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7148,40 +7148,35 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 	}
 }

-static inline int uclamp_scale_from_percent(char *buf, u64 *value)
+static inline int uclamp_scale_from_percent(char *buf, s64 *percent, u64 *scale)
 {
-	*value = SCHED_CAPACITY_SCALE;
+	*scale = SCHED_CAPACITY_SCALE;

 	buf = strim(buf);
 	if (strncmp("max", buf, 4)) {
-		s64 percent;
 		int ret;

-		ret = cgroup_parse_float(buf, 2, &percent);
+		ret = cgroup_parse_float(buf, 2, percent);
 		if (ret)
 			return ret;

-		percent <<= SCHED_CAPACITY_SHIFT;
-		*value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
+		*scale = *percent << SCHED_CAPACITY_SHIFT;
+		*scale = DIV_ROUND_CLOSEST_ULL(*scale, 10000);
 	}

 	return 0;
 }

-static inline u64 uclamp_percent_from_scale(u64 value)
-{
-	return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
-}
-
 static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes,
 				    loff_t off)
 {
 	struct task_group *tg;
 	u64 min_value;
+	s64 percent;
 	int ret;

-	ret = uclamp_scale_from_percent(buf, &min_value);
+	ret = uclamp_scale_from_percent(buf, &percent, &min_value);
 	if (ret)
 		return ret;
 	if (min_value > SCHED_CAPACITY_SCALE)
@@ -7197,6 +7192,9 @@ static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
 	/* Update effective clamps to track the most restrictive value */
 	cpu_util_update_eff(of_css(of));

+	/* Keep track of the actual requested value */
+	tg->uclamp_pct[UCLAMP_MIN] = percent;
+
 	rcu_read_unlock();
 	mutex_unlock(&uclamp_mutex);

@@ -7209,9 +7207,10 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg;
 	u64 max_value;
+	s64 percent;
 	int ret;

-	ret = uclamp_scale_from_percent(buf, &max_value);
+	ret = uclamp_scale_from_percent(buf, &percent, &max_value);
 	if (ret)
 		return ret;
 	if (max_value > SCHED_CAPACITY_SCALE)
@@ -7227,6 +7226,9 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
 	/* Update effective clamps to track the most restrictive value */
 	cpu_util_update_eff(of_css(of));

+	/* Keep track of the actual requested value */
+	tg->uclamp_pct[UCLAMP_MAX] = percent;
+
 	rcu_read_unlock();
 	mutex_unlock(&uclamp_mutex);

@@ -7251,7 +7253,7 @@ static inline void cpu_uclamp_print(struct seq_file *sf,
 		return;
 	}

-	percent = uclamp_percent_from_scale(util_clamp);
+	percent = tg->uclamp_pct[clamp_id];
 	percent = div_u64_rem(percent, 100, &rem);
 	seq_printf(sf, "%llu.%u\n", percent, rem);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e37f4a4e536..4f9b0c660310 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -395,6 +395,8 @@ struct task_group {
 	struct cfs_bandwidth	cfs_bandwidth;

 #ifdef CONFIG_UCLAMP_TASK_GROUP
+	/* The two decimal precision [%] value requested from user-space */
+	unsigned int		uclamp_pct[UCLAMP_CNT];
 	/* Clamp values requested for a task group */
 	struct uclamp_se	uclamp_req[UCLAMP_CNT];
 	/* Effective clamp values used for a task group */
---8<---

> I guess one way to fix this would be to revert back to having a
> 1024-scale for the cgroup interface too ... Though I understand Tejun
> wanted % for consistency with other things.

Yes that would be another option, which will also keep aligned the per-task
and system-wide APIs with the CGroups one. Although, AFAIU, having two
different APIs is not considered a major issue.

> So, I'm not sure if this is still up for discussion, but in any case I
> wanted to say I support your original idea of using a 1024-scale for the
> cgroups interface, since that would solve the 'issue' above and keeps
> things consistent with the per-task API too.


Right, I'm personally more leaning toward either going back to use
SCHED_CAPACITY_SCALE or the add the small change I suggested above.

Tejun, Peter: any preference? Alternative suggestions?

> Thanks,
> Quentin

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps
  2019-07-08  8:43 ` [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps Patrick Bellasi
@ 2019-07-15 16:42   ` Michal Koutný
  2019-07-16 14:07     ` Patrick Bellasi
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2019-07-15 16:42 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On Mon, Jul 08, 2019 at 09:43:54AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> Since it's possible for a cpu.uclamp.min value to be bigger than the
> cpu.uclamp.max value, ensure local consistency by restricting each
> "protection"
> (i.e. min utilization) with the corresponding "limit" (i.e. max
> utilization).
I think this constraint should be mentioned in the Documentation/....

> +static void cpu_util_update_eff(struct cgroup_subsys_state *css)
> +{
> +	struct cgroup_subsys_state *top_css = css;
> +	struct uclamp_se *uc_se = NULL;
> +	unsigned int eff[UCLAMP_CNT];
> +	unsigned int clamp_id;
> +	unsigned int clamps;
> +
> +	css_for_each_descendant_pre(css, top_css) {
> +		uc_se = css_tg(css)->parent
> +			? css_tg(css)->parent->uclamp : NULL;
> +
> +		for_each_clamp_id(clamp_id) {
> +			/* Assume effective clamps matches requested clamps */
> +			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
> +			/* Cap effective clamps with parent's effective clamps */
> +			if (uc_se &&
> +			    eff[clamp_id] > uc_se[clamp_id].value) {
> +				eff[clamp_id] = uc_se[clamp_id].value;
> +			}
> +		}
> +		/* Ensure protection is always capped by limit */
> +		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
> +
> +		/* Propagate most restrictive effective clamps */
> +		clamps = 0x0;
> +		uc_se = css_tg(css)->uclamp;
(Nitpick only, reassigning child where was parent before decreases
readibility. IMO)

> +		for_each_clamp_id(clamp_id) {
> +			if (eff[clamp_id] == uc_se[clamp_id].value)
> +				continue;
> +			uc_se[clamp_id].value = eff[clamp_id];
> +			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
Shouldn't these writes be synchronized with writes from
__setscheduler_uclamp()?

> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
  2019-07-08  8:43 ` [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group Patrick Bellasi
@ 2019-07-15 16:42   ` Michal Koutný
  2019-07-16 14:34     ` Patrick Bellasi
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2019-07-15 16:42 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On Mon, Jul 08, 2019 at 09:43:55AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> +static void uclamp_update_root_tg(void)
> +{
> +	struct task_group *tg = &root_task_group;
> +
> +	uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
> +		      sysctl_sched_uclamp_util_min, false);
> +	uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
> +		      sysctl_sched_uclamp_util_max, false);
> +
> +	cpu_util_update_eff(&root_task_group.css);
> +}
cpu_util_update_eff internally calls css_for_each_descendant_pre() so
this should be protected with rcu_read_lock().


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  2019-07-08  8:43 ` [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps Patrick Bellasi
@ 2019-07-15 16:42   ` Michal Koutný
  2019-07-16 14:34     ` Patrick Bellasi
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2019-07-15 16:42 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On Mon, Jul 08, 2019 at 09:43:56AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> This mimics what already happens for a task's CPU affinity mask when the
> task is also in a cpuset, i.e. cgroup attributes are always used to
> restrict per-task attributes.
If I am not mistaken when set_schedaffinity(2) call is made that results
in an empty cpuset, the call fails with EINVAL [1].

If I track the code correctly, the values passed to sched_setattr(2) are
checked against the trivial validity (umin <= umax) and later on, they
are adjusted to match the effective clamping of the containing
task_group. Is that correct?

If the user attempted to sched_setattr [a, b], and the effective uclamp
was [c, d] such that [a, b] ∩ [c, d] = ∅, the set uclamp will be
silently moved out of their intended range. Wouldn't it be better to
return with EINVAL too when the intersection is empty (since the user
supplied range won't be attained)?

Michal

[1] http://www.linux-arm.org/git?p=linux-pb.git;a=blob;f=kernel/sched/core.c;h=ddc5fcd4b9cfaa95496b24d8599c03bc140e768e;hb=2c15043a2a2b5d86eb409556cbe1e36d4fd275b5#l1660


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 0/5] Add utilization clamping support (CGroups API)
  2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
                   ` (4 preceding siblings ...)
  2019-07-08  8:43 ` [PATCH v11 5/5] sched/core: uclamp: Update CPU's refcount on TG's clamp changes Patrick Bellasi
@ 2019-07-15 16:51 ` Michal Koutný
  2019-07-16 14:03   ` Patrick Bellasi
  5 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2019-07-15 16:51 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

Hello Patrick.

I took a look at your series and I've posted some notes to your patches.

One applies more to the series overall -- I see there is enum uclamp_id
defined but at many places (local variables, function args) int or
unsigned int is used. Besides the inconsistency, I think it'd be nice to
use the enum at these places.

(Also, I may suggest CCing ML cgroups@vger.kernel.org where more eyes
may be available to the cgroup part of your series.)

Michal


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 0/5] Add utilization clamping support (CGroups API)
  2019-07-15 16:51 ` [PATCH v11 0/5] Add utilization clamping support (CGroups API) Michal Koutný
@ 2019-07-16 14:03   ` Patrick Bellasi
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-16 14:03 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On 15-Jul 18:51, Michal Koutný wrote:
> Hello Patrick.

Hi Michal,

> I took a look at your series and I've posted some notes to your patches.

thanks for your review!

> One applies more to the series overall -- I see there is enum uclamp_id
> defined but at many places (local variables, function args) int or
> unsigned int is used. Besides the inconsistency, I think it'd be nice to
> use the enum at these places.

Right, I think in some of the original versions I had few code paths
where it was not possible to use enum values. That seems no more the case.

Since this change is likely affecting also core bits already merged in
5.3, in v12 I'm going to add a bulk rename patch at the end of the
series, so that we can keep a better tracking of this change.

> (Also, I may suggest CCing ML cgroups@vger.kernel.org where more eyes
> may be available to the cgroup part of your series.)

Good point, I'll add that for the upcoming v12 posting.

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps
  2019-07-15 16:42   ` Michal Koutný
@ 2019-07-16 14:07     ` Patrick Bellasi
  2019-07-16 15:29       ` Michal Koutný
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-16 14:07 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

Hi Michal,

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:54AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > Since it's possible for a cpu.uclamp.min value to be bigger than the
> > cpu.uclamp.max value, ensure local consistency by restricting each
> > "protection"
> > (i.e. min utilization) with the corresponding "limit" (i.e. max
> > utilization).
> I think this constraint should be mentioned in the Documentation/....

That note comes from the previous review cycle and it's based on a
request from Tejun to align uclamp behaviors with the way the
delegation model is supposed to work.

I guess this part of the documentation:
   https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?highlight=protections#resource-distribution-models
should already cover the expected uclamp min/max behaviors.

However, I guess "repetita iuvant" in this case. I'll call this out
explicitly in the description of cpu.uclamp.min.

> > +static void cpu_util_update_eff(struct cgroup_subsys_state *css)
> > +{
> > +	struct cgroup_subsys_state *top_css = css;
> > +	struct uclamp_se *uc_se = NULL;
> > +	unsigned int eff[UCLAMP_CNT];
> > +	unsigned int clamp_id;
> > +	unsigned int clamps;
> > +
> > +	css_for_each_descendant_pre(css, top_css) {
> > +		uc_se = css_tg(css)->parent
> > +			? css_tg(css)->parent->uclamp : NULL;
> > +
> > +		for_each_clamp_id(clamp_id) {
> > +			/* Assume effective clamps matches requested clamps */
> > +			eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
> > +			/* Cap effective clamps with parent's effective clamps */
> > +			if (uc_se &&
> > +			    eff[clamp_id] > uc_se[clamp_id].value) {
> > +				eff[clamp_id] = uc_se[clamp_id].value;
> > +			}
> > +		}
> > +		/* Ensure protection is always capped by limit */
> > +		eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
> > +
> > +		/* Propagate most restrictive effective clamps */
> > +		clamps = 0x0;
> > +		uc_se = css_tg(css)->uclamp;
> (Nitpick only, reassigning child where was parent before decreases
> readibility. IMO)

Did not checked but I think the compiler will figure out it can still
use a single pointer for both assignments.
I'll let's the compiler to its job and add in a dedicated stack var
for the parent pointer.


> > +		for_each_clamp_id(clamp_id) {
> > +			if (eff[clamp_id] == uc_se[clamp_id].value)
> > +				continue;
> > +			uc_se[clamp_id].value = eff[clamp_id];
> > +			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
> Shouldn't these writes be synchronized with writes from
> __setscheduler_uclamp()?

You right, the synchronization is introduced by a later patch:

   sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
  2019-07-15 16:42   ` Michal Koutný
@ 2019-07-16 14:34     ` Patrick Bellasi
  2019-07-16 15:36       ` Michal Koutný
  2019-07-16 15:46       ` Joel Fernandes
  0 siblings, 2 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-16 14:34 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:55AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > +static void uclamp_update_root_tg(void)
> > +{
> > +	struct task_group *tg = &root_task_group;
> > +
> > +	uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
> > +		      sysctl_sched_uclamp_util_min, false);
> > +	uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
> > +		      sysctl_sched_uclamp_util_max, false);
> > +
> > +	cpu_util_update_eff(&root_task_group.css);
> > +}
> cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> this should be protected with rcu_read_lock().

Right, good catch! Will add in v12.

Cheers,
Patrick


-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  2019-07-15 16:42   ` Michal Koutný
@ 2019-07-16 14:34     ` Patrick Bellasi
  2019-07-16 15:58       ` Michal Koutný
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-16 14:34 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:56AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > This mimics what already happens for a task's CPU affinity mask when the
> > task is also in a cpuset, i.e. cgroup attributes are always used to
> > restrict per-task attributes.
> If I am not mistaken when set_schedaffinity(2) call is made that results
> in an empty cpuset, the call fails with EINVAL [1].
> 
> If I track the code correctly, the values passed to sched_setattr(2) are
> checked against the trivial validity (umin <= umax) and later on, they
> are adjusted to match the effective clamping of the containing
> task_group. Is that correct?
> 
> If the user attempted to sched_setattr [a, b], and the effective uclamp
> was [c, d] such that [a, b] ∩ [c, d] = ∅, the set uclamp will be
> silently moved out of their intended range. Wouldn't it be better to
> return with EINVAL too when the intersection is empty (since the user
> supplied range won't be attained)?

You right for the cpuset case, but I don't think we never end up with
a "empty" set in the case of utilization clamping.

We limit clamps hierarchically in such a way that:

  clamp[clamp_id] = min(task::clamp[clamp_id],
                        tg::clamp[clamp_id],
                        system::clamp[clamp_id])

and we ensure, on top of the above that:

  clamp[UCLAMP_MIN] = min(clamp[UCLAMP_MIN], clamp[UCLAMP_MAX])

Since it's all and only about "capping" values, at the very extreme
case you can end up with:

  clamp[UCLAMP_MIN] = clamp[UCLAMP_MAX] = 0

but that's till a valid configuration.

Am I missing something?

Otherwise, I think the changelog sentence you quoted is just
misleading.  I'll remove it from v12 since it does not really clarify
anything more then the rest of the message.

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps
  2019-07-16 14:07     ` Patrick Bellasi
@ 2019-07-16 15:29       ` Michal Koutný
  2019-07-16 17:55         ` Patrick Bellasi
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2019-07-16 15:29 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Alessio Balsini, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Joel Fernandes, Paul Turner, Steve Muckle,
	Suren Baghdasaryan, Todd Kjos, Peter Zijlstra,
	Rafael J . Wysocki, Tejun Heo, Vincent Guittot, Viresh Kumar,
	Juri Lelli, Ingo Molnar, linux-kernel, linux-pm

On Tue, Jul 16, 2019 at 03:07:06PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> That note comes from the previous review cycle and it's based on a
> request from Tejun to align uclamp behaviors with the way the
> delegation model is supposed to work.
I saw and hopefully understood that reasoning -- uclamp.min has the
protection semantics and uclamp.max the limit semantics.

However, what took me some time to comprehend when the effected
uclamp.min and uclamp.max cross over, i.e. that uclamp.min is then bound
by uclamp.max (besides parent's uclamp.min). Your commit message
explains that and I think it's relevant for the kernel docs file
itself.

> You right, the synchronization is introduced by a later patch:
> 
>    sched/core: uclamp: Update CPU's refcount on TG's clamp changes
I saw that lock but didn't realize __setscheduler_uclamp() touches only
task's struct uclamp_se, none of task_group's/css's (which is under
uclamp_mutex). That seems correct.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
  2019-07-16 14:34     ` Patrick Bellasi
@ 2019-07-16 15:36       ` Michal Koutný
  2019-07-16 18:00         ` Patrick Bellasi
  2019-07-16 15:46       ` Joel Fernandes
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2019-07-16 15:36 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Alessio Balsini, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Joel Fernandes, Paul Turner, Steve Muckle,
	Suren Baghdasaryan, Todd Kjos, Peter Zijlstra,
	Rafael J . Wysocki, Tejun Heo, Vincent Guittot, Viresh Kumar,
	Juri Lelli, Ingo Molnar, linux-kernel, linux-pm

On Tue, Jul 16, 2019 at 03:34:17PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> > this should be protected with rcu_read_lock().
> 
> Right, good catch! Will add in v12.
When I responded to your other patch, it occurred to me that since
cpu_util_update_eff goes writing down to child csses, it should take
also uclamp_mutex here to avoid race with direct cgroup attribute
writers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
  2019-07-16 14:34     ` Patrick Bellasi
  2019-07-16 15:36       ` Michal Koutný
@ 2019-07-16 15:46       ` Joel Fernandes
  1 sibling, 0 replies; 24+ messages in thread
From: Joel Fernandes @ 2019-07-16 15:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Michal Koutný,
	LKML, Linux PM, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Steve Muckle, Suren Baghdasaryan, Alessio Balsini

On Tue, Jul 16, 2019 at 10:34 AM Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
>
> On 15-Jul 18:42, Michal Koutný wrote:
> > On Mon, Jul 08, 2019 at 09:43:55AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > > +static void uclamp_update_root_tg(void)
> > > +{
> > > +   struct task_group *tg = &root_task_group;
> > > +
> > > +   uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
> > > +                 sysctl_sched_uclamp_util_min, false);
> > > +   uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
> > > +                 sysctl_sched_uclamp_util_max, false);
> > > +
> > > +   cpu_util_update_eff(&root_task_group.css);
> > > +}
> > cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> > this should be protected with rcu_read_lock().
>
> Right, good catch! Will add in v12.
>

Hopefully these can catch it in the near future!
https://lore.kernel.org/rcu/20190715143705.117908-1-joel@joelfernandes.org/T/#t

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  2019-07-16 14:34     ` Patrick Bellasi
@ 2019-07-16 15:58       ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2019-07-16 15:58 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Alessio Balsini, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Joel Fernandes, Paul Turner, Steve Muckle,
	Suren Baghdasaryan, Todd Kjos, Peter Zijlstra,
	Rafael J . Wysocki, Tejun Heo, Vincent Guittot, Viresh Kumar,
	Juri Lelli, Ingo Molnar, linux-kernel, linux-pm

On Tue, Jul 16, 2019 at 03:34:35PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> Am I missing something?
No, it's rather my misinterpretation of the syscall semantics.

> Otherwise, I think the changelog sentence you quoted is just
> misleading.
It certainly mislead me to thinking about the sched_setattr calls as
requests of utilization being in the given interval (substituting 0 or 1 when
only one boundary is given, and further constrained by tg's interval).

I see your point, those are actually two (mostly) independent controls.
Makes sense now.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps
  2019-07-16 15:29       ` Michal Koutný
@ 2019-07-16 17:55         ` Patrick Bellasi
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-16 17:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alessio Balsini, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Joel Fernandes, Paul Turner, Steve Muckle,
	Suren Baghdasaryan, Todd Kjos, Peter Zijlstra,
	Rafael J . Wysocki, Tejun Heo, Vincent Guittot, Viresh Kumar,
	Juri Lelli, Ingo Molnar, linux-kernel, linux-pm

On 16-Jul 17:29, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 03:07:06PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > That note comes from the previous review cycle and it's based on a
> > request from Tejun to align uclamp behaviors with the way the
> > delegation model is supposed to work.
> I saw and hopefully understood that reasoning -- uclamp.min has the
> protection semantics and uclamp.max the limit semantics.
> 
> However, what took me some time to comprehend when the effected
> uclamp.min and uclamp.max cross over, i.e. that uclamp.min is then bound
> by uclamp.max (besides parent's uclamp.min). Your commit message
> explains that and I think it's relevant for the kernel docs file
> itself.

Right, I've just added a paragraph to the cpu.uclamp.min documentation.

> > You right, the synchronization is introduced by a later patch:
> > 
> >    sched/core: uclamp: Update CPU's refcount on TG's clamp changes
> I saw that lock but didn't realize __setscheduler_uclamp() touches only
> task's struct uclamp_se, none of task_group's/css's (which is under
> uclamp_mutex). That seems correct.

Right, the mutex is used only on the cgroup side. That's because the
CGroup API can affect multiple tasks running on different CPUs, thus
we wanna make sure we don't come up with race conditions. In that
path we can also afford to go a bit slower.

In the fast path instead we rely on the rq-locks to ensure
serialization on RUNNABLE tasks clamp updates.

Coming from the __setscheduler_uclamp() side however we don't sync
RUNNABLE tasks immediately. We delay the update to next enqueue
opportunity.

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group
  2019-07-16 15:36       ` Michal Koutný
@ 2019-07-16 18:00         ` Patrick Bellasi
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-16 18:00 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alessio Balsini, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Joel Fernandes, Paul Turner, Steve Muckle,
	Suren Baghdasaryan, Todd Kjos, Peter Zijlstra,
	Rafael J . Wysocki, Tejun Heo, Vincent Guittot, Viresh Kumar,
	Juri Lelli, Ingo Molnar, linux-kernel, linux-pm

On 16-Jul 17:36, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 03:34:17PM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > > cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> > > this should be protected with rcu_read_lock().
> > 
> > Right, good catch! Will add in v12.
> When I responded to your other patch, it occurred to me that since
> cpu_util_update_eff goes writing down to child csses, it should take
> also uclamp_mutex here to avoid race with direct cgroup attribute
> writers.

Yep, I should drop the "dedicated" mutex we have now in
sysctl_sched_uclamp_handler() and use the uclamp_mutex we already
have.

Thanks, Patrick


-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller
  2019-07-08  8:43 ` [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller Patrick Bellasi
  2019-07-08 11:08   ` Quentin Perret
@ 2019-07-18 14:52   ` Tejun Heo
  2019-07-18 15:26     ` Patrick Bellasi
  1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2019-07-18 14:52 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

Hello, Patrick.

On Mon, Jul 08, 2019 at 09:43:53AM +0100, Patrick Bellasi wrote:
> +static inline void cpu_uclamp_print(struct seq_file *sf,
> +				    enum uclamp_id clamp_id)
> +{
> +	struct task_group *tg;
> +	u64 util_clamp;
> +	u64 percent;
> +	u32 rem;
> +
> +	rcu_read_lock();
> +	tg = css_tg(seq_css(sf));
> +	util_clamp = tg->uclamp_req[clamp_id].value;
> +	rcu_read_unlock();
> +
> +	if (util_clamp == SCHED_CAPACITY_SCALE) {
> +		seq_puts(sf, "max\n");
> +		return;
> +	}
> +
> +	percent = uclamp_percent_from_scale(util_clamp);
> +	percent = div_u64_rem(percent, 100, &rem);
> +	seq_printf(sf, "%llu.%u\n", percent, rem);

"%llu.%02u" otherwise 20.01 gets printed as 20.1

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller
  2019-07-18 14:52   ` Tejun Heo
@ 2019-07-18 15:26     ` Patrick Bellasi
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick Bellasi @ 2019-07-18 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan,
	Alessio Balsini

On 18-Jul 07:52, Tejun Heo wrote:
> Hello, Patrick.
> 
> On Mon, Jul 08, 2019 at 09:43:53AM +0100, Patrick Bellasi wrote:
> > +static inline void cpu_uclamp_print(struct seq_file *sf,
> > +				    enum uclamp_id clamp_id)
> > +{
> > +	struct task_group *tg;
> > +	u64 util_clamp;
> > +	u64 percent;
> > +	u32 rem;
> > +
> > +	rcu_read_lock();
> > +	tg = css_tg(seq_css(sf));
> > +	util_clamp = tg->uclamp_req[clamp_id].value;
> > +	rcu_read_unlock();
> > +
> > +	if (util_clamp == SCHED_CAPACITY_SCALE) {
> > +		seq_puts(sf, "max\n");
> > +		return;
> > +	}
> > +
> > +	percent = uclamp_percent_from_scale(util_clamp);
> > +	percent = div_u64_rem(percent, 100, &rem);
> > +	seq_printf(sf, "%llu.%u\n", percent, rem);
> 
> "%llu.%02u" otherwise 20.01 gets printed as 20.1

Yup!... good point! :)

Since we already collected many feedbacks, I've got a v12 ready for posting.
Maybe you better wait for that before going on with the review.

Thanks,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-08  8:43 [PATCH v11 0/5] Add utilization clamping support (CGroups API) Patrick Bellasi
2019-07-08  8:43 ` [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller Patrick Bellasi
2019-07-08 11:08   ` Quentin Perret
2019-07-15 13:38     ` Patrick Bellasi
2019-07-18 14:52   ` Tejun Heo
2019-07-18 15:26     ` Patrick Bellasi
2019-07-08  8:43 ` [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps Patrick Bellasi
2019-07-15 16:42   ` Michal Koutný
2019-07-16 14:07     ` Patrick Bellasi
2019-07-16 15:29       ` Michal Koutný
2019-07-16 17:55         ` Patrick Bellasi
2019-07-08  8:43 ` [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group Patrick Bellasi
2019-07-15 16:42   ` Michal Koutný
2019-07-16 14:34     ` Patrick Bellasi
2019-07-16 15:36       ` Michal Koutný
2019-07-16 18:00         ` Patrick Bellasi
2019-07-16 15:46       ` Joel Fernandes
2019-07-08  8:43 ` [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps Patrick Bellasi
2019-07-15 16:42   ` Michal Koutný
2019-07-16 14:34     ` Patrick Bellasi
2019-07-16 15:58       ` Michal Koutný
2019-07-08  8:43 ` [PATCH v11 5/5] sched/core: uclamp: Update CPU's refcount on TG's clamp changes Patrick Bellasi
2019-07-15 16:51 ` [PATCH v11 0/5] Add utilization clamping support (CGroups API) Michal Koutný
2019-07-16 14:03   ` Patrick Bellasi

Linux-PM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pm/0 linux-pm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pm linux-pm/ https://lore.kernel.org/linux-pm \
		linux-pm@vger.kernel.org linux-pm@archiver.kernel.org
	public-inbox-index linux-pm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox