linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 0/5] Add capacity capping support to the CPU controller
@ 2017-02-28 14:38 Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 1/5] sched/core: add capacity constraints to " Patrick Bellasi
                   ` (6 more replies)
  0 siblings, 7 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-02-28 14:38 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Todd Kjos, Tim Murray,
	Andres Oportus, Joel Fernandes, Juri Lelli, Morten Rasmussen,
	Dietmar Eggemann

Was: SchedTune: central, scheduler-driven, power-perfomance control

This series presents a possible alternative design for what has been presented
in the past as SchedTune. This redesign has been defined to address the main
concerns and comments collected in the LKML discussion [1] as well at the last
LPC [2].
The aim of this posting is to present a working prototype which implements
what has been discussed [2] with people like PeterZ, PaulT and TejunH.

The main differences with respect to the previous proposal [1] are:
 1. Task boosting/capping is now implemented as an extension on top of
    the existing CGroup CPU controller.
 2. The previous boosting strategy, based on the inflation of the CPU's
    utilization, has been now replaced by a more simple yet effective set
    of capacity constraints.

The proposed approach allows to constrain the minimum and maximum capacity
of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
The set of active constraints are tracked by the core scheduler, thus they
apply across all the scheduling classes. The value of the constraints are
used to clamp the CPU utilization when the schedutil CPUFreq's governor
selects a frequency for that CPU.

This means that the new proposed approach allows to extend the concept of
tasks classification to frequencies selection, thus allowing informed
run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
optimization policies such as:
 a) Boosting of important tasks, by enforcing a minimum capacity in the
    CPUs where they are enqueued for execution.
 b) Capping of background tasks, by enforcing a maximum capacity.
 c) Containment of OPPs for RT tasks which cannot easily be switched to
    the usage of the DL class, but still don't need to run at the maximum
    frequency.

The new approach has also been designed to be compliant to CGroups v2
principles, such as the support for single hierarchy and the "limit"
resource distribution model (described in Documentation/cgroup-v2.txt).

A further development of this idea is under development and will allow to
exploit the same capacity capping attributes, in conjunction to the recently
merged capacity awareness bits [3], in order to achieve a more complete tasks
boosting/capping strategy which is completely scheduler driven and based on
user-space defined tasks classification.

The first three patches of this series introduces capacity_{min,max} tracking
in the core scheduler, as an extension of the CPU controller.
The fourth patch integrates capacity capping with schedutil for FAIR tasks,
while the last patch does the same for RT/DL tasks.

This series is based on top of today's tip/sched/core and is public available
from this repository:

  git://www.linux-arm.com/linux-pb eas/stune/rfcv3

Cheers Patrick

.:: References
[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342
[3] https://lkml.org/lkml/2016/10/14/312

Patrick Bellasi (5):
  sched/core: add capacity constraints to CPU controller
  sched/core: track CPU's capacity_{min,max}
  sched/core: sync capacity_{min,max} between slow and fast paths
  sched/{core,cpufreq_schedutil}: add capacity clamping for FAIR tasks
  sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks

 include/linux/sched.h            |   3 +
 init/Kconfig                     |  17 ++
 kernel/sched/core.c              | 352 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |  83 ++++++++-
 kernel/sched/sched.h             |  31 ++++
 5 files changed, 478 insertions(+), 8 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
@ 2017-02-28 14:38 ` Patrick Bellasi
  2017-03-13 10:46   ` Joel Fernandes (Google)
  2017-03-20 17:15   ` Tejun Heo
  2017-02-28 14:38 ` [RFC v3 2/5] sched/core: track CPU's capacity_{min,max} Patrick Bellasi
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-02-28 14:38 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo

The CPU CGroup controller allows to assign a specified (maximum)
bandwidth to tasks within a group, however it does not enforce any
constraint on how such bandwidth can be consumed.
With the integration of schedutil, the scheduler has now the proper
information about a task to select  the most suitable frequency to
satisfy tasks needs.

This patch extends the CPU controller by adding a couple of new
attributes, capacity_min and capacity_max, which can be used to enforce
bandwidth boosting and capping. More specifically:

- capacity_min: defines the minimum capacity which should be granted
                (by schedutil) when a task in this group is running,
                i.e. the task will run at least at that capacity

- capacity_max: defines the maximum capacity which can be granted
                (by schedutil) when a task in this group is running,
                i.e. the task can run up to that capacity

These attributes:
a) are tunable at all hierarchy levels, i.e. root group too
b) allow to create subgroups of tasks which are not violating the
   capacity constraints defined by the parent group.
   Thus, tasks on a subgroup can only be more boosted and/or more
   capped, which is matching with the "limits" schema proposed by
   the "Resource Distribution Model (RDM)" suggested by the
   CGroups v2 documentation (Documentation/cgroup-v2.txt)

This patch provides the basic support to expose the two new attributes
and to validate their run-time update based on the "limits" schema of
the RDM.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig         |  17 ++++++
 kernel/sched/core.c  | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   8 +++
 3 files changed, 170 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index e1a93734..71e46ce 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1044,6 +1044,23 @@ menuconfig CGROUP_SCHED
 	  bandwidth allocation to such task groups. It uses cgroups to group
 	  tasks.
 
+config CAPACITY_CLAMPING
+	bool "Capacity clamping per group of tasks"
+	depends on CPU_FREQ_GOV_SCHEDUTIL
+	depends on CGROUP_SCHED
+	default n
+	help
+	  This feature allows the scheduler to enforce maximum and minimum
+	  capacity on each CPU based on RUNNABLE tasks currently scheduled
+	  on that CPU.
+	  Minimum capacity can be used for example to "boost" the performance
+	  of important tasks by running them on an OPP which can be higher than
+	  the minimum one eventually selected by the schedutil governor.
+	  Maximum capacity can be used for example to "restrict" the maximum
+	  OPP which can be requested by background tasks.
+
+	  If in doubt, say N.
+
 if CGROUP_SCHED
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34e2291..a171d49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6015,6 +6015,11 @@ void __init sched_init(void)
 	autogroup_init(&init_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	root_task_group.cap_clamp[CAP_CLAMP_MIN] = 0;
+	root_task_group.cap_clamp[CAP_CLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -6310,6 +6315,11 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	tg->cap_clamp[CAP_CLAMP_MIN] = parent->cap_clamp[CAP_CLAMP_MIN];
+	tg->cap_clamp[CAP_CLAMP_MAX] = parent->cap_clamp[CAP_CLAMP_MAX];
+#endif
+
 	return tg;
 
 err:
@@ -6899,6 +6909,129 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static DEFINE_MUTEX(cap_clamp_mutex);
+
+static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cftype, u64 value)
+{
+	struct cgroup_subsys_state *pos;
+	unsigned int min_value;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+	mutex_lock(&cap_clamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->cap_clamp[CAP_CLAMP_MIN] == min_value)
+		goto done;
+
+	/* Ensure to not exceed the maximum capacity */
+	if (tg->cap_clamp[CAP_CLAMP_MAX] < min_value)
+		goto out;
+
+	/* Ensure min cap fits within parent constraint */
+	if (tg->parent &&
+	    tg->parent->cap_clamp[CAP_CLAMP_MIN] > min_value)
+		goto out;
+
+	/* Each child must be a subset of us */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->cap_clamp[CAP_CLAMP_MIN] < min_value)
+			goto out;
+	}
+
+	tg->cap_clamp[CAP_CLAMP_MIN] = min_value;
+
+done:
+	ret = 0;
+out:
+	rcu_read_unlock();
+	mutex_unlock(&cap_clamp_mutex);
+
+	return ret;
+}
+
+static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
+				      struct cftype *cftype, u64 value)
+{
+	struct cgroup_subsys_state *pos;
+	unsigned int max_value;
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	max_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+	mutex_lock(&cap_clamp_mutex);
+	rcu_read_lock();
+
+	tg = css_tg(css);
+
+	/* Already at the required value */
+	if (tg->cap_clamp[CAP_CLAMP_MAX] == max_value)
+		goto done;
+
+	/* Ensure to not go below the minimum capacity */
+	if (tg->cap_clamp[CAP_CLAMP_MIN] > max_value)
+		goto out;
+
+	/* Ensure max cap fits within parent constraint */
+	if (tg->parent &&
+	    tg->parent->cap_clamp[CAP_CLAMP_MAX] < max_value)
+		goto out;
+
+	/* Each child must be a subset of us */
+	css_for_each_child(pos, css) {
+		if (css_tg(pos)->cap_clamp[CAP_CLAMP_MAX] > max_value)
+			goto out;
+	}
+
+	tg->cap_clamp[CAP_CLAMP_MAX] = max_value;
+
+done:
+	ret = 0;
+out:
+	rcu_read_unlock();
+	mutex_unlock(&cap_clamp_mutex);
+
+	return ret;
+}
+
+static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 min_capacity;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
+	rcu_read_unlock();
+
+	return min_capacity;
+}
+
+static u64 cpu_capacity_max_read_u64(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 max_capacity;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	max_capacity = tg->cap_clamp[CAP_CLAMP_MAX];
+	rcu_read_unlock();
+
+	return max_capacity;
+}
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7193,6 +7326,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CAPACITY_CLAMPING
+	{
+		.name = "capacity_min",
+		.read_u64 = cpu_capacity_min_read_u64,
+		.write_u64 = cpu_capacity_min_write_u64,
+	},
+	{
+		.name = "capacity_max",
+		.read_u64 = cpu_capacity_max_read_u64,
+		.write_u64 = cpu_capacity_max_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "cfs_quota_us",
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 71b10a9..05dae4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,6 +273,14 @@ struct task_group {
 #endif
 #endif
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+#define CAP_CLAMP_MIN 0
+#define CAP_CLAMP_MAX 1
+
+	/* Min and Max capacity constraints for tasks in this group */
+	unsigned int cap_clamp[2];
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct sched_rt_entity **rt_se;
 	struct rt_rq **rt_rq;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v3 2/5] sched/core: track CPU's capacity_{min,max}
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 1/5] sched/core: add capacity constraints to " Patrick Bellasi
@ 2017-02-28 14:38 ` Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 3/5] sched/core: sync capacity_{min,max} between slow and fast paths Patrick Bellasi
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-02-28 14:38 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo

When CAPACITY_CLAMPING is enabled, each task is subject to a capacity
constraint which is defined by the capacity_{min,max} attributes of the
task group it belongs to.
At run-time, the capacity constraints of RUNNABLE tasks must be
aggregated to figure out the actual capacity constraints to enforce on
each CPU.

This aggregation must meet two main goals:
  1) ensure the minimum capacity required by the most boosted
     RUNNABLE task on that CPU
  2) do not penalize the less capped RUNNABLE tasks on that CPU

Thus, the aggregation for both the capacity constraints turns out to be
a MAX function on the min/max capacities of RUNNABLE tasks:

  cpu_capacity_min := MAX(capacity_min_i), for each RUNNABLE task_i
  cpu_capacity_max := MAX(capacity_max_i), for each RUNNABLE task_i

The aggregation at CPU level is done by exploiting the task_struct.
Tasks are already enqueued, via fields embedded in their task_struct, in
many different lists and trees. This patch uses the same approach to
keep track of the capacity constraints enforced by every task on a CPU.
To this purpose:
  - each CPU's RQ has two RBTrees, which are used to track the minimum and
    maximum capacity constraints of all the tasks enqueue on that CPU
  - task_struct has two rb_node which allows to position that task in the
    minimum/maximum capacity tracking RBTree of the CPU in which the
    task is enqueued

This patch provides the RBTree support code while, for the sake of
clarity, the synchronization between the
   fast path: {enqueue,dequeue}_task
and the
   slow path: cpu_capacity_{min,max}_write_u64
is provided in a dedicated patch.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/sched.h |   3 ++
 kernel/sched/core.c   | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  23 +++++++++
 3 files changed, 155 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e2ed46d..5838570 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1531,6 +1531,9 @@ struct task_struct {
 	struct sched_rt_entity rt;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *sched_task_group;
+#ifdef CONFIG_CAPACITY_CLAMPING
+	struct rb_node cap_clamp_node[2];
+#endif
 #endif
 	struct sched_dl_entity dl;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a171d49..8f509be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -752,11 +752,128 @@ static void set_load_weight(struct task_struct *p)
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static inline void
+cap_clamp_insert_capacity(struct rq *rq, struct task_struct *p,
+			  unsigned int cap_idx)
+{
+	struct cap_clamp_cpu *cgc = &rq->cap_clamp_cpu[cap_idx];
+	struct task_group *tg = task_group(p);
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	struct rb_node **link;
+	struct rb_root *root;
+	struct rb_node *node;
+	int update_cache = 1;
+	u64 capacity_new;
+	u64 capacity_cur;
+
+	node = &p->cap_clamp_node[cap_idx];
+	if (!RB_EMPTY_NODE(node)) {
+		WARN(1, "cap_clamp_insert_capacity() on non empty node\n");
+		return;
+	}
+
+	/*
+	 * The capacity_{min,max} the task is subject to is defined by the
+	 * current TG the task belongs to. The TG's capacity constraints are
+	 * thus used to place the task within the rbtree used to track
+	 * the capacity_{min,max} for the CPU.
+	 */
+	capacity_new = tg->cap_clamp[cap_idx];
+	root = &cgc->tree;
+	link = &root->rb_node;
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 cap_clamp_node[cap_idx]);
+		capacity_cur = task_group(entry)->cap_clamp[cap_idx];
+		if (capacity_new <= capacity_cur) {
+			link = &parent->rb_left;
+			update_cache = 0;
+		} else {
+			link = &parent->rb_right;
+		}
+	}
+
+	/* Add task's capacity_{min,max} and rebalance the rbtree */
+	rb_link_node(node, parent, link);
+	rb_insert_color(node, root);
+
+	if (!update_cache)
+		return;
+
+	/* Update CPU's capacity cache pointer */
+	cgc->value = capacity_new;
+	cgc->node = node;
+}
+
+static inline void
+cap_clamp_remove_capacity(struct rq *rq, struct task_struct *p,
+			  unsigned int cap_idx)
+{
+	struct cap_clamp_cpu *cgc = &rq->cap_clamp_cpu[cap_idx];
+	struct rb_node *node = &p->cap_clamp_node[cap_idx];
+	struct rb_root *root = &cgc->tree;
+
+	if (RB_EMPTY_NODE(node)) {
+		WARN(1, "cap_clamp_remove_capacity on empty node\n");
+		return;
+	}
+
+	/* Update CPU's capacity_{min,max} cache pointer */
+	if (node == cgc->node) {
+		struct rb_node *prev_node = rb_prev(node);
+
+		/* Reset value in case this was the last task */
+		cgc->value = (cap_idx == CAP_CLAMP_MIN)
+			? 0 : SCHED_CAPACITY_SCALE;
+
+		/* Update node and value, if there is another task */
+		cgc->node = prev_node;
+		if (cgc->node) {
+			struct task_struct *entry;
+
+			entry = rb_entry(cgc->node, struct task_struct,
+					 cap_clamp_node[cap_idx]);
+			cgc->value = task_group(entry)->cap_clamp[cap_idx];
+		}
+	}
+
+	/* Remove task's capacity_{min,max] */
+	rb_erase(node, root);
+	RB_CLEAR_NODE(node);
+}
+
+static inline void
+cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	/* Track task's min/max capacities */
+	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MIN);
+	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MAX);
+}
+
+static inline void
+cap_clamp_dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	/* Track task's min/max capacities */
+	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MIN);
+	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MAX);
+}
+#else
+static inline void
+cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags) { }
+static inline void
+cap_clamp_dequeue_task(struct rq *rq, struct task_struct *p, int flags) { }
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_rq_clock(rq);
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
+	cap_clamp_enqueue_task(rq, p, flags);
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -765,6 +882,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
+	cap_clamp_dequeue_task(rq, p, flags);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2412,6 +2530,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
+#ifdef CONFIG_CAPACITY_CLAMPING
+	RB_CLEAR_NODE(&p->cap_clamp_node[CAP_CLAMP_MIN]);
+	RB_CLEAR_NODE(&p->cap_clamp_node[CAP_CLAMP_MAX]);
+#endif
 
 	put_cpu();
 	return 0;
@@ -6058,6 +6180,13 @@ void __init sched_init(void)
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+		rq->cap_clamp_cpu[CAP_CLAMP_MIN].tree = RB_ROOT;
+		rq->cap_clamp_cpu[CAP_CLAMP_MIN].node = NULL;
+		rq->cap_clamp_cpu[CAP_CLAMP_MAX].tree = RB_ROOT;
+		rq->cap_clamp_cpu[CAP_CLAMP_MAX].node = NULL;
+#endif
+
 		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05dae4a..4a7d224 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -461,6 +461,24 @@ struct cfs_rq {
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
 
+/* Capacity capping -related fields in a runqueue */
+struct cap_clamp_cpu {
+	/*
+	 * RBTree to keep sorted capacity constraints
+	 * of currently RUNNABLE tasks on a CPU.
+	 */
+	struct rb_root tree;
+
+	/*
+	 * Pointers to the RUNNABLE task defining the current
+	 * capacity constraint for a CPU.
+	 */
+	struct rb_node *node;
+
+	/* Current CPU's capacity constraint */
+	unsigned int value;
+};
+
 static inline int rt_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@@ -648,6 +666,11 @@ struct rq {
 	struct list_head *tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+	/* Min and Max capacity constraints */
+	struct cap_clamp_cpu cap_clamp_cpu[2];
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 	/*
 	 * This is part of a global counter where only the total sum
 	 * over all CPUs matters. A task can increase this counter on
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v3 3/5] sched/core: sync capacity_{min,max} between slow and fast paths
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 1/5] sched/core: add capacity constraints to " Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 2/5] sched/core: track CPU's capacity_{min,max} Patrick Bellasi
@ 2017-02-28 14:38 ` Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 4/5] sched/{core,cpufreq_schedutil}: add capacity clamping for FAIR tasks Patrick Bellasi
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-02-28 14:38 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo

At enqueue/dequeue time a task needs to be placed in the CPU's rb_tree,
depending on the current capacity_{min,max} value of the cgroup it
belongs to. Thus, we need to grant that these values cannot be changed
while the task is in these critical sections.

To this purpose, this patch uses the same locking schema already used by
the __set_cpus_allowed_ptr. We might uselessly lock the (previous) RQ of
a !RUNNABLE task, but that's the price to pay to safely serialize
capacity_{min,max} updates with enqueues, dequeues and migrations.

This patch adds the synchronization calls required to grant that each
RUNNABLE task is always in the correct relative position within the
RBTree. Specifically, when a group's capacity_{min,max} value is
updated, each task in that group is re-positioned within the rb_tree, if
currently RUNNABLE and its relative position has changed.
This operation is mutually exclusive with the task being {en,de}queued
or migrated via a task_rq_lock().

It's worth to notice that moving a task from a CGroup to another,
perhaps with different capacity_{min,max} values, is already covered by
the current locking schema. Indeed, this operation requires a dequeue
from the original cgroup's RQ followed by an enqueue in the new one.
The same argument is true for tasks migrations thus, tasks migrations
between CPUs and CGruoups are ultimately managed like tasks
wakeups/sleeps.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f509be..d620bc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -846,9 +846,68 @@ cap_clamp_remove_capacity(struct rq *rq, struct task_struct *p,
 	RB_CLEAR_NODE(node);
 }
 
+static void
+cap_clamp_update_capacity(struct task_struct *p, unsigned int cap_idx)
+{
+	struct task_group *tg = task_group(p);
+	unsigned int next_cap = SCHED_CAPACITY_SCALE;
+	unsigned int prev_cap = 0;
+	struct task_struct *entry;
+	struct rb_node *node;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * Lock the CPU's RBTree where the task is (eventually) queued.
+	 *
+	 * We might uselessly lock the (previous) RQ of a !RUNNABLE task, but
+	 * that's the price to pay to safely serializ capacity_{min,max}
+	 * updates with enqueues, dequeues and migration operations, which is
+	 * the same locking schema already in use by __set_cpus_allowed_ptr().
+	 */
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * If the task has not a node in the rbtree, it's not yet RUNNABLE or
+	 * it's going to be enqueued with the proper value.
+	 * The setting of the cap_clamp_node is serialized by task_rq_lock().
+	 */
+	if (RB_EMPTY_NODE(&p->cap_clamp_node[cap_idx]))
+		goto done;
+
+	/* Check current position in the capacity rbtree */
+	node = rb_next(&p->cap_clamp_node[cap_idx]);
+	if (node) {
+		entry = rb_entry(node, struct task_struct,
+				 cap_clamp_node[cap_idx]);
+		next_cap = task_group(entry)->cap_clamp[cap_idx];
+	}
+	node = rb_prev(&p->cap_clamp_node[cap_idx]);
+	if (node) {
+		entry = rb_entry(node, struct task_struct,
+				 cap_clamp_node[cap_idx]);
+		prev_cap = task_group(entry)->cap_clamp[cap_idx];
+	}
+
+	/* If relative position has not changed: nothing to do */
+	if (prev_cap <= tg->cap_clamp[cap_idx] &&
+	    next_cap >= tg->cap_clamp[cap_idx])
+		goto done;
+
+	/* Reposition this node within the rbtree */
+	cap_clamp_remove_capacity(rq, p, cap_idx);
+	cap_clamp_insert_capacity(rq, p, cap_idx);
+
+done:
+	task_rq_unlock(rq, p, &rf);
+}
+
 static inline void
 cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_held(&rq->lock);
+
 	/* Track task's min/max capacities */
 	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MIN);
 	cap_clamp_insert_capacity(rq, p, CAP_CLAMP_MAX);
@@ -857,6 +916,9 @@ cap_clamp_enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 static inline void
 cap_clamp_dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_held(&rq->lock);
+
 	/* Track task's min/max capacities */
 	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MIN);
 	cap_clamp_remove_capacity(rq, p, CAP_CLAMP_MAX);
@@ -7046,8 +7108,10 @@ static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
 				      struct cftype *cftype, u64 value)
 {
 	struct cgroup_subsys_state *pos;
+	struct css_task_iter it;
 	unsigned int min_value;
 	struct task_group *tg;
+	struct task_struct *p;
 	int ret = -EINVAL;
 
 	min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
@@ -7078,6 +7142,12 @@ static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
 
 	tg->cap_clamp[CAP_CLAMP_MIN] = min_value;
 
+	/* Update the capacity_min of RUNNABLE tasks */
+	css_task_iter_start(css, &it);
+	while ((p = css_task_iter_next(&it)))
+		cap_clamp_update_capacity(p, CAP_CLAMP_MIN);
+	css_task_iter_end(&it);
+
 done:
 	ret = 0;
 out:
@@ -7091,8 +7161,10 @@ static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
 				      struct cftype *cftype, u64 value)
 {
 	struct cgroup_subsys_state *pos;
+	struct css_task_iter it;
 	unsigned int max_value;
 	struct task_group *tg;
+	struct task_struct *p;
 	int ret = -EINVAL;
 
 	max_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
@@ -7123,6 +7195,12 @@ static int cpu_capacity_max_write_u64(struct cgroup_subsys_state *css,
 
 	tg->cap_clamp[CAP_CLAMP_MAX] = max_value;
 
+	/* Update the capacity_max of RUNNABLE tasks */
+	css_task_iter_start(css, &it);
+	while ((p = css_task_iter_next(&it)))
+		cap_clamp_update_capacity(p, CAP_CLAMP_MAX);
+	css_task_iter_end(&it);
+
 done:
 	ret = 0;
 out:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v3 4/5] sched/{core,cpufreq_schedutil}: add capacity clamping for FAIR tasks
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
                   ` (2 preceding siblings ...)
  2017-02-28 14:38 ` [RFC v3 3/5] sched/core: sync capacity_{min,max} between slow and fast paths Patrick Bellasi
@ 2017-02-28 14:38 ` Patrick Bellasi
  2017-02-28 14:38 ` [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks Patrick Bellasi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-02-28 14:38 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki

Each time a frequency update is required via schedutil, we must grant
the capacity_{min,max} constraints enforced in the current CPU by the
set of currently RUNNABLE tasks.

This patch adds the required support to clamp the utilization generated
by FAIR tasks within the boundaries defined by the current constraints.
The clamped utilization is ultimately used to select the frequency
thus allowing both to:
 - boost small tasks
   by running them at least at a minimum granted capacity (i.e. frequency)
 - cap background tasks
   by running them only up to a maximum granted capacity (i.e. frequency)

The default values for boosting and capping are defined to be:
 - capacity_min: 0
 - capacity_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 68 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fd46593..51484f7 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -192,6 +192,54 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, unsigned long *util,
 	sg_cpu->iowait_boost >>= 1;
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static inline
+void cap_clamp_cpu_range(unsigned int cpu, unsigned int *cap_min,
+			 unsigned int *cap_max)
+{
+	struct cap_clamp_cpu *cgc;
+
+	*cap_min = 0;
+	cgc = &cpu_rq(cpu)->cap_clamp_cpu[CAP_CLAMP_MIN];
+	if (cgc->node)
+		*cap_min = cgc->value;
+
+	*cap_max = SCHED_CAPACITY_SCALE;
+	cgc = &cpu_rq(cpu)->cap_clamp_cpu[CAP_CLAMP_MAX];
+	if (cgc->node)
+		*cap_max = cgc->value;
+}
+
+static inline
+unsigned int cap_clamp_cpu_util(unsigned int cpu, unsigned int util)
+{
+	unsigned int cap_max, cap_min;
+
+	cap_clamp_cpu_range(cpu, &cap_min, &cap_max);
+	return clamp(util, cap_min, cap_max);
+}
+
+static inline
+void cap_clamp_compose(unsigned int *cap_min, unsigned int *cap_max,
+		       unsigned int j_cap_min, unsigned int j_cap_max)
+{
+	*cap_min = max(*cap_min, j_cap_min);
+	*cap_max = max(*cap_max, j_cap_max);
+}
+
+#define cap_clamp_util_range(util, cap_min, cap_max) \
+	clamp_t(typeof(util), util, cap_min, cap_max)
+
+#else
+
+#define cap_clamp_cpu_range(cpu, cap_min, cap_max) { }
+#define cap_clamp_cpu_util(cpu, util) util
+#define cap_clamp_compose(cap_min, cap_max, j_cap_min, j_cap_max) { }
+#define cap_clamp_util_range(util, cap_min, cap_max) util
+
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
 				unsigned int flags)
 {
@@ -212,6 +260,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	} else {
 		sugov_get_util(&util, &max);
 		sugov_iowait_boost(sg_cpu, &util, &max);
+		util = cap_clamp_cpu_util(smp_processor_id(), util);
 		next_f = get_next_freq(sg_cpu, util, max);
 	}
 	sugov_update_commit(sg_policy, time, next_f);
@@ -225,6 +274,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned int max_f = policy->cpuinfo.max_freq;
 	u64 last_freq_update_time = sg_policy->last_freq_update_time;
+	unsigned int cap_max = SCHED_CAPACITY_SCALE;
+	unsigned int cap_min = 0;
 	unsigned int j;
 
 	if (flags & SCHED_CPUFREQ_RT_DL)
@@ -232,9 +283,13 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
 
 	sugov_iowait_boost(sg_cpu, &util, &max);
 
+	/* Initialize clamping range based on caller CPU constraints */
+	cap_clamp_cpu_range(smp_processor_id(), &cap_min, &cap_max);
+
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu;
 		unsigned long j_util, j_max;
+		unsigned int j_cap_max, j_cap_min;
 		s64 delta_ns;
 
 		if (j == smp_processor_id())
@@ -264,8 +319,21 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
 		}
 
 		sugov_iowait_boost(j_sg_cpu, &util, &max);
+
+		/*
+		 * Update clamping range based on this CPU constraints, but
+		 * only if this CPU is not currently idle. Idle CPUs do not
+		 * enforce constraints in a shared frequency domain.
+		 */
+		if (!idle_cpu(j)) {
+			cap_clamp_cpu_range(j, &j_cap_min, &j_cap_max);
+			cap_clamp_compose(&cap_min, &cap_max,
+					  j_cap_min, j_cap_max);
+		}
 	}
 
+	/* Clamp utilization on aggregated CPUs ranges */
+	util = cap_clamp_util_range(util, cap_min, cap_max);
 	return get_next_freq(sg_cpu, util, max);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
                   ` (3 preceding siblings ...)
  2017-02-28 14:38 ` [RFC v3 4/5] sched/{core,cpufreq_schedutil}: add capacity clamping for FAIR tasks Patrick Bellasi
@ 2017-02-28 14:38 ` Patrick Bellasi
  2017-03-13 10:08   ` Joel Fernandes (Google)
  2017-03-15 11:41 ` [RFC v3 0/5] Add capacity capping support to the CPU controller Rafael J. Wysocki
  2017-03-20 14:51 ` Tejun Heo
  6 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-02-28 14:38 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki

Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
Such a mandatory policy can be made more tunable from userspace thus
allowing for example to define a reasonable max capacity (i.e.
frequency) which is required for the execution of a specific RT/DL
workload. This will contribute to make the RT class more "friendly" for
power/energy sensible applications.

This patch extends the usage of capacity_{min,max} to the RT/DL classes.
Whenever a task in these classes is RUNNABLE, the capacity required is
defined by the constraints of the control group that task belongs to.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 51484f7..18abd62 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -256,7 +256,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 		return;
 
 	if (flags & SCHED_CPUFREQ_RT_DL) {
-		next_f = policy->cpuinfo.max_freq;
+		util = cap_clamp_cpu_util(smp_processor_id(),
+					  SCHED_CAPACITY_SCALE);
+		next_f = get_next_freq(sg_cpu, util, policy->cpuinfo.max_freq);
 	} else {
 		sugov_get_util(&util, &max);
 		sugov_iowait_boost(sg_cpu, &util, &max);
@@ -272,15 +274,11 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
 {
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	struct cpufreq_policy *policy = sg_policy->policy;
-	unsigned int max_f = policy->cpuinfo.max_freq;
 	u64 last_freq_update_time = sg_policy->last_freq_update_time;
 	unsigned int cap_max = SCHED_CAPACITY_SCALE;
 	unsigned int cap_min = 0;
 	unsigned int j;
 
-	if (flags & SCHED_CPUFREQ_RT_DL)
-		return max_f;
-
 	sugov_iowait_boost(sg_cpu, &util, &max);
 
 	/* Initialize clamping range based on caller CPU constraints */
@@ -308,10 +306,11 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
 			j_sg_cpu->iowait_boost = 0;
 			continue;
 		}
-		if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
-			return max_f;
 
-		j_util = j_sg_cpu->util;
+		if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+			j_util = cap_clamp_cpu_util(j, SCHED_CAPACITY_SCALE);
+		else
+			j_util = j_sg_cpu->util;
 		j_max = j_sg_cpu->max;
 		if (j_util * max > j_max * util) {
 			util = j_util;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-02-28 14:38 ` [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks Patrick Bellasi
@ 2017-03-13 10:08   ` Joel Fernandes (Google)
  2017-03-15 11:40     ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Fernandes (Google) @ 2017-03-13 10:08 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Joel Fernandes, Andres Oportus

Hi Patrick,

On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
> Such a mandatory policy can be made more tunable from userspace thus
> allowing for example to define a reasonable max capacity (i.e.
> frequency) which is required for the execution of a specific RT/DL
> workload. This will contribute to make the RT class more "friendly" for
> power/energy sensible applications.
>
> This patch extends the usage of capacity_{min,max} to the RT/DL classes.
> Whenever a task in these classes is RUNNABLE, the capacity required is
> defined by the constraints of the control group that task belongs to.
>

We briefly discussed this at Linaro Connect that this works well for
sporadic RT tasks that run briefly and then sleep for long periods of
time - so certainly this patch is good, but its only a partial
solution to the problem of frequent and short-sleepers and something
is required to keep the boost active for short non-RUNNABLE as well.
The behavior with many periodic RT tasks is that they will sleep for
short intervals and run for short intervals periodically. In this case
removing the clamp (or the boost as in schedtune v2) on a dequeue will
essentially mean during a narrow window cpufreq can drop the frequency
and only to make it go back up again.

Currently for schedtune v2, I am working on prototyping something like
the following for Android:
- if RT task is enqueue, introduce the boost.
- When task is dequeued, start a timer for a  "minimum deboost delay
time" before taking out the boost.
- If task is enqueued again before the timer fires, then cancel the timer.

I don't think any "fix" to this particular issue should be to the
schedutil governor and should be sorted before going to cpufreq itself
(that is before making the request). What do you think about this?

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-02-28 14:38 ` [RFC v3 1/5] sched/core: add capacity constraints to " Patrick Bellasi
@ 2017-03-13 10:46   ` Joel Fernandes (Google)
  2017-03-15 11:20     ` Patrick Bellasi
  2017-03-20 17:15   ` Tejun Heo
  1 sibling, 1 reply; 66+ messages in thread
From: Joel Fernandes (Google) @ 2017-03-13 10:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Joel Fernandes

On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> The CPU CGroup controller allows to assign a specified (maximum)
> bandwidth to tasks within a group, however it does not enforce any
> constraint on how such bandwidth can be consumed.
> With the integration of schedutil, the scheduler has now the proper
> information about a task to select  the most suitable frequency to
> satisfy tasks needs.
[..]

> +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> +                                    struct cftype *cft)
> +{
> +       struct task_group *tg;
> +       u64 min_capacity;
> +
> +       rcu_read_lock();
> +       tg = css_tg(css);
> +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
the write path) to avoid load-tearing?

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-13 10:46   ` Joel Fernandes (Google)
@ 2017-03-15 11:20     ` Patrick Bellasi
  2017-03-15 13:20       ` Joel Fernandes
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-15 11:20 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Joel Fernandes

On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > The CPU CGroup controller allows to assign a specified (maximum)
> > bandwidth to tasks within a group, however it does not enforce any
> > constraint on how such bandwidth can be consumed.
> > With the integration of schedutil, the scheduler has now the proper
> > information about a task to select  the most suitable frequency to
> > satisfy tasks needs.
> [..]
> 
> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > +                                    struct cftype *cft)
> > +{
> > +       struct task_group *tg;
> > +       u64 min_capacity;
> > +
> > +       rcu_read_lock();
> > +       tg = css_tg(css);
> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> 
> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> the write path) to avoid load-tearing?

tg->cap_clamp is an "unsigned int" and thus I would expect a single
memory access to write/read it, isn't it? I mean: I do not expect the
compiler "to mess" with these accesses.

However, if your concerns are more about overlapping read/write for the
same capacity from different threads, then perhaps we should better
use a mutex to serialize these two functions... not entirely convinced...

> Thanks,
> Joel

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-13 10:08   ` Joel Fernandes (Google)
@ 2017-03-15 11:40     ` Patrick Bellasi
  2017-03-15 12:59       ` Joel Fernandes
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-15 11:40 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Joel Fernandes, Andres Oportus

On 13-Mar 03:08, Joel Fernandes (Google) wrote:
> Hi Patrick,
> 
> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
> > Such a mandatory policy can be made more tunable from userspace thus
> > allowing for example to define a reasonable max capacity (i.e.
> > frequency) which is required for the execution of a specific RT/DL
> > workload. This will contribute to make the RT class more "friendly" for
> > power/energy sensible applications.
> >
> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.
> > Whenever a task in these classes is RUNNABLE, the capacity required is
> > defined by the constraints of the control group that task belongs to.
> >
> 
> We briefly discussed this at Linaro Connect that this works well for
> sporadic RT tasks that run briefly and then sleep for long periods of
> time - so certainly this patch is good, but its only a partial
> solution to the problem of frequent and short-sleepers and something
> is required to keep the boost active for short non-RUNNABLE as well.
> The behavior with many periodic RT tasks is that they will sleep for
> short intervals and run for short intervals periodically. In this case
> removing the clamp (or the boost as in schedtune v2) on a dequeue will
> essentially mean during a narrow window cpufreq can drop the frequency
> and only to make it go back up again.
> 
> Currently for schedtune v2, I am working on prototyping something like
> the following for Android:
> - if RT task is enqueue, introduce the boost.
> - When task is dequeued, start a timer for a  "minimum deboost delay
> time" before taking out the boost.
> - If task is enqueued again before the timer fires, then cancel the timer.
> 
> I don't think any "fix" to this particular issue should be to the
> schedutil governor and should be sorted before going to cpufreq itself
> (that is before making the request). What do you think about this?

My short observations are:

1) for certain RT tasks, which have a quite "predictable" activation
   pattern, we should definitively try to use DEADLINE... which will
   factor out all "boosting potential races" since the bandwidth
   requirements are well defined at task description time.

2) CPU boosting is, at least for the time being, a best-effort feature
   which is introduced mainly for FAIR tasks.

3) Tracking the boost at enqueue/dequeue time matches with the design
   to track features/properties of the currently RUNNABLE tasks, while
   avoiding to add yet another signal to track CPUs utilization.

4) Previous point is about "separation of concerns", thus IMHO any
   policy defining how to consume the CPU utilization signal
   (whether it is boosted or not) should be responsibility of
   schedutil, which eventually does not exclude useful input from the
   scheduler.

5) I understand the usefulness of a scale down threshold for schedutil
   to reduce the current OPP, while I don't get the point for a scale
   up threshold. If the system is demanding more capacity and there
   are not HW constrains (e.g. pending changes) then we should go up
   as soon as possible.

Finally, I think we can improve quite a lot the boosting issues you
are having with RT tasks by better refining the schedutil thresholds
implementation.

We already have some patches pending for review:
   https://lkml.org/lkml/2017/3/2/385
which fixes some schedutil issue and we will follow up with others
trying to improve the rate-limiting to not compromise responsiveness.


> Thanks,
> Joel

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
                   ` (4 preceding siblings ...)
  2017-02-28 14:38 ` [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks Patrick Bellasi
@ 2017-03-15 11:41 ` Rafael J. Wysocki
  2017-03-15 12:59   ` Patrick Bellasi
  2017-03-20 14:51 ` Tejun Heo
  6 siblings, 1 reply; 66+ messages in thread
From: Rafael J. Wysocki @ 2017-03-15 11:41 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Morten Rasmussen, Dietmar Eggemann

On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> Was: SchedTune: central, scheduler-driven, power-perfomance control
> 
> This series presents a possible alternative design for what has been presented
> in the past as SchedTune. This redesign has been defined to address the main
> concerns and comments collected in the LKML discussion [1] as well at the last
> LPC [2].
> The aim of this posting is to present a working prototype which implements
> what has been discussed [2] with people like PeterZ, PaulT and TejunH.
> 
> The main differences with respect to the previous proposal [1] are:
>  1. Task boosting/capping is now implemented as an extension on top of
>     the existing CGroup CPU controller.
>  2. The previous boosting strategy, based on the inflation of the CPU's
>     utilization, has been now replaced by a more simple yet effective set
>     of capacity constraints.
> 
> The proposed approach allows to constrain the minimum and maximum capacity
> of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
> The set of active constraints are tracked by the core scheduler, thus they
> apply across all the scheduling classes. The value of the constraints are
> used to clamp the CPU utilization when the schedutil CPUFreq's governor
> selects a frequency for that CPU.
> 
> This means that the new proposed approach allows to extend the concept of
> tasks classification to frequencies selection, thus allowing informed
> run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
> optimization policies such as:
>  a) Boosting of important tasks, by enforcing a minimum capacity in the
>     CPUs where they are enqueued for execution.
>  b) Capping of background tasks, by enforcing a maximum capacity.
>  c) Containment of OPPs for RT tasks which cannot easily be switched to
>     the usage of the DL class, but still don't need to run at the maximum
>     frequency.

Do you have any practical examples of that, like for example what exactly
Android is going to use this for?

I gather that there is some experience with the current EAS implementation
there, so I wonder how this work is related to that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-15 11:40     ` Patrick Bellasi
@ 2017-03-15 12:59       ` Joel Fernandes
  2017-03-15 14:44         ` Juri Lelli
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Fernandes @ 2017-03-15 12:59 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 13-Mar 03:08, Joel Fernandes (Google) wrote:
>> Hi Patrick,
>>
>> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
>> > Such a mandatory policy can be made more tunable from userspace thus
>> > allowing for example to define a reasonable max capacity (i.e.
>> > frequency) which is required for the execution of a specific RT/DL
>> > workload. This will contribute to make the RT class more "friendly" for
>> > power/energy sensible applications.
>> >
>> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.
>> > Whenever a task in these classes is RUNNABLE, the capacity required is
>> > defined by the constraints of the control group that task belongs to.
>> >
>>
>> We briefly discussed this at Linaro Connect that this works well for
>> sporadic RT tasks that run briefly and then sleep for long periods of
>> time - so certainly this patch is good, but its only a partial
>> solution to the problem of frequent and short-sleepers and something
>> is required to keep the boost active for short non-RUNNABLE as well.
>> The behavior with many periodic RT tasks is that they will sleep for
>> short intervals and run for short intervals periodically. In this case
>> removing the clamp (or the boost as in schedtune v2) on a dequeue will
>> essentially mean during a narrow window cpufreq can drop the frequency
>> and only to make it go back up again.
>>
>> Currently for schedtune v2, I am working on prototyping something like
>> the following for Android:
>> - if RT task is enqueue, introduce the boost.
>> - When task is dequeued, start a timer for a  "minimum deboost delay
>> time" before taking out the boost.
>> - If task is enqueued again before the timer fires, then cancel the timer.
>>
>> I don't think any "fix" to this particular issue should be to the
>> schedutil governor and should be sorted before going to cpufreq itself
>> (that is before making the request). What do you think about this?
>
> My short observations are:
>
> 1) for certain RT tasks, which have a quite "predictable" activation
>    pattern, we should definitively try to use DEADLINE... which will
>    factor out all "boosting potential races" since the bandwidth
>    requirements are well defined at task description time.

I don't immediately see how deadline can fix this, when a task is
dequeued after end of its current runtime, its bandwidth will be
subtracted from the active running bandwidth. This is what drives the
DL part of the capacity request. In this case, we run into the same
issue as with the boost-removal on dequeue. Isn't it?

> 4) Previous point is about "separation of concerns", thus IMHO any
>    policy defining how to consume the CPU utilization signal
>    (whether it is boosted or not) should be responsibility of
>    schedutil, which eventually does not exclude useful input from the
>    scheduler.
>
> 5) I understand the usefulness of a scale down threshold for schedutil
>    to reduce the current OPP, while I don't get the point for a scale
>    up threshold. If the system is demanding more capacity and there
>    are not HW constrains (e.g. pending changes) then we should go up
>    as soon as possible.
>
> Finally, I think we can improve quite a lot the boosting issues you
> are having with RT tasks by better refining the schedutil thresholds
> implementation.
>
> We already have some patches pending for review:
>    https://lkml.org/lkml/2017/3/2/385
> which fixes some schedutil issue and we will follow up with others
> trying to improve the rate-limiting to not compromise responsiveness.

I agree we can try to explore fixing schedutil to do the right thing.

J.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-15 11:41 ` [RFC v3 0/5] Add capacity capping support to the CPU controller Rafael J. Wysocki
@ 2017-03-15 12:59   ` Patrick Bellasi
  2017-03-16  1:04     ` Rafael J. Wysocki
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-15 12:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Morten Rasmussen, Dietmar Eggemann

On 15-Mar 12:41, Rafael J. Wysocki wrote:
> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> > Was: SchedTune: central, scheduler-driven, power-perfomance control
> > 
> > This series presents a possible alternative design for what has been presented
> > in the past as SchedTune. This redesign has been defined to address the main
> > concerns and comments collected in the LKML discussion [1] as well at the last
> > LPC [2].
> > The aim of this posting is to present a working prototype which implements
> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.
> > 
> > The main differences with respect to the previous proposal [1] are:
> >  1. Task boosting/capping is now implemented as an extension on top of
> >     the existing CGroup CPU controller.
> >  2. The previous boosting strategy, based on the inflation of the CPU's
> >     utilization, has been now replaced by a more simple yet effective set
> >     of capacity constraints.
> > 
> > The proposed approach allows to constrain the minimum and maximum capacity
> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
> > The set of active constraints are tracked by the core scheduler, thus they
> > apply across all the scheduling classes. The value of the constraints are
> > used to clamp the CPU utilization when the schedutil CPUFreq's governor
> > selects a frequency for that CPU.
> > 
> > This means that the new proposed approach allows to extend the concept of
> > tasks classification to frequencies selection, thus allowing informed
> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
> > optimization policies such as:
> >  a) Boosting of important tasks, by enforcing a minimum capacity in the
> >     CPUs where they are enqueued for execution.
> >  b) Capping of background tasks, by enforcing a maximum capacity.
> >  c) Containment of OPPs for RT tasks which cannot easily be switched to
> >     the usage of the DL class, but still don't need to run at the maximum
> >     frequency.
> 
> Do you have any practical examples of that, like for example what exactly
> Android is going to use this for?

In general, every "informed run-time" usually know quite a lot about
tasks requirements and how they impact the user experience.

In Android for example tasks are classified depending on their _current_
role. We can distinguish for example between:

- TOP_APP:    which are tasks currently affecting the UI, i.e. part of
              the app currently in foreground
- BACKGROUND: which are tasks not directly impacting the user
              experience

Given these information it could make sense to adopt different
service/optimization policy for different tasks.
For example, we can be interested in
giving maximum responsiveness to TOP_APP tasks while we still want to
be able to save as much energy as possible for the BACKGROUND tasks.

That's where the proposal in this series (partially) comes on hand.

What we propose is a "standard" interface to collect sensible
information from "informed run-times" which can be used to:

a) classify tasks according to the main optimization goals:
   performance boosting vs energy saving

b) support a more dynamic tuning of kernel side behaviors, mainly
   OPPs selection and tasks placement

Regarding this last point, this series specifically represents a
proposal for the integration with schedutil. The main usages we are
looking for in Android are:

a) Boosting the OPP selected for certain critical tasks, with the goal
   to speed-up their completion regardless of (potential) energy impacts.
   A kind-of "race-to-idle" policy for certain tasks.

b) Capping the OPP selection for certain non critical tasks, which is
   a major concerns especially for RT tasks in mobile context, but
   it also apply to FAIR tasks representing background activities.

> I gather that there is some experience with the current EAS implementation
> there, so I wonder how this work is related to that.

You right. We started developing a task boosting strategy a couple of
years ago. The first implementation we did is what is currently in use
by the EAS version in used on Pixel smartphones.

Since the beginning our attitude has always been "mainline first".
However, we found it extremely valuable to proof both interface's
design and feature's benefits on real devices. That's why we keep
backporting these bits on different Android kernels.

Google, which primary representatives are in CC, is also quite focused
on using mainline solutions for their current and future solutions.
That's why, after the release of the Pixel devices end of last year,
we refreshed and posted the proposal on LKML [1] and collected a first
run of valuable feedbacks at LCP [2].

This posting is an expression of the feedbacks collected so far and
the main goal for us are:
1) validate once more the soundness of a scheduler-driven run-time
   power-performance control which is based on information collected
   from informed run-time
2) get an agreement on whether the current interface can be considered
   sufficiently "mainline friendly" to have a chance to get merged
3) rework/refactor what is required if point 2 is not (yet) satisfied

It's worth to notice that these bits are completely independent from
EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
itself and it can be quite useful in many different scenarios where
EAS is not used at all. A simple example is making schedutil to behave
concurrently like the powersave governor for certain tasks and the
performance governor for other tasks.

As a final remark, this series is going to be a discussion topic in
the upcoming OSPM summit [3]. It would be nice if we can get there
with a sufficient knowledge of the main goals and the current status.
However, please let's keep discussing here about all the possible
concerns which can be raised about this proposal.

> Thanks,
> Rafael

Cheers Patrick

[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342
[3] http://retis.sssup.it/ospm-summit/

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-15 11:20     ` Patrick Bellasi
@ 2017-03-15 13:20       ` Joel Fernandes
  2017-03-15 16:10         ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Fernandes @ 2017-03-15 13:20 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Paul McKenney

On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 13-Mar 03:46, Joel Fernandes (Google) wrote:
>> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>> > The CPU CGroup controller allows to assign a specified (maximum)
>> > bandwidth to tasks within a group, however it does not enforce any
>> > constraint on how such bandwidth can be consumed.
>> > With the integration of schedutil, the scheduler has now the proper
>> > information about a task to select  the most suitable frequency to
>> > satisfy tasks needs.
>> [..]
>>
>> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
>> > +                                    struct cftype *cft)
>> > +{
>> > +       struct task_group *tg;
>> > +       u64 min_capacity;
>> > +
>> > +       rcu_read_lock();
>> > +       tg = css_tg(css);
>> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
>>
>> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
>> the write path) to avoid load-tearing?
>
> tg->cap_clamp is an "unsigned int" and thus I would expect a single
> memory access to write/read it, isn't it? I mean: I do not expect the
> compiler "to mess" with these accesses.

This depends on compiler and arch. I'm not sure if its in practice
these days an issue, but see section on 'load tearing' in
Documentation/memory-barriers.txt . If compiler decided to break down
the access to multiple accesses due to some reason, then might be a
problem.

Adding Paul for his expert opinion on the matter ;)

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-15 12:59       ` Joel Fernandes
@ 2017-03-15 14:44         ` Juri Lelli
  2017-03-15 16:13           ` Joel Fernandes
  0 siblings, 1 reply; 66+ messages in thread
From: Juri Lelli @ 2017-03-15 14:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

Hi Joel,

On 15/03/17 05:59, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > On 13-Mar 03:08, Joel Fernandes (Google) wrote:
> >> Hi Patrick,
> >>
> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> >> <patrick.bellasi@arm.com> wrote:
> >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
> >> > Such a mandatory policy can be made more tunable from userspace thus
> >> > allowing for example to define a reasonable max capacity (i.e.
> >> > frequency) which is required for the execution of a specific RT/DL
> >> > workload. This will contribute to make the RT class more "friendly" for
> >> > power/energy sensible applications.
> >> >
> >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.
> >> > Whenever a task in these classes is RUNNABLE, the capacity required is
> >> > defined by the constraints of the control group that task belongs to.
> >> >
> >>
> >> We briefly discussed this at Linaro Connect that this works well for
> >> sporadic RT tasks that run briefly and then sleep for long periods of
> >> time - so certainly this patch is good, but its only a partial
> >> solution to the problem of frequent and short-sleepers and something
> >> is required to keep the boost active for short non-RUNNABLE as well.
> >> The behavior with many periodic RT tasks is that they will sleep for
> >> short intervals and run for short intervals periodically. In this case
> >> removing the clamp (or the boost as in schedtune v2) on a dequeue will
> >> essentially mean during a narrow window cpufreq can drop the frequency
> >> and only to make it go back up again.
> >>
> >> Currently for schedtune v2, I am working on prototyping something like
> >> the following for Android:
> >> - if RT task is enqueue, introduce the boost.
> >> - When task is dequeued, start a timer for a  "minimum deboost delay
> >> time" before taking out the boost.
> >> - If task is enqueued again before the timer fires, then cancel the timer.
> >>
> >> I don't think any "fix" to this particular issue should be to the
> >> schedutil governor and should be sorted before going to cpufreq itself
> >> (that is before making the request). What do you think about this?
> >
> > My short observations are:
> >
> > 1) for certain RT tasks, which have a quite "predictable" activation
> >    pattern, we should definitively try to use DEADLINE... which will
> >    factor out all "boosting potential races" since the bandwidth
> >    requirements are well defined at task description time.
> 
> I don't immediately see how deadline can fix this, when a task is
> dequeued after end of its current runtime, its bandwidth will be
> subtracted from the active running bandwidth. This is what drives the
> DL part of the capacity request. In this case, we run into the same
> issue as with the boost-removal on dequeue. Isn't it?
> 

Unfortunately, I still have to post the set of patches (based on Luca's
reclaiming set) that introduces driving of clock frequency from
DEADLINE, so I guess everything we can discuss about how DEADLINE might
help here might be difficult to understand. :(

I should definitely fix that.

However, trying to quickly summarize how that would work (for who is
already somewhat familiar with reclaiming bits):

 - a task utilization contribution is accounted for (at rq level) as
   soon as it wakes up for the first time in a new period
 - its contribution is then removed after the 0lag time (or when the
   task gets throttled)
 - frequency transitions are triggered accordingly

So, I don't see why triggering a go down request after the 0lag time
expired and quickly reacting to tasks waking up would have create
problems in your case?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-15 13:20       ` Joel Fernandes
@ 2017-03-15 16:10         ` Paul E. McKenney
  2017-03-15 16:44           ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2017-03-15 16:10 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo

On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> >> <patrick.bellasi@arm.com> wrote:
> >> > The CPU CGroup controller allows to assign a specified (maximum)
> >> > bandwidth to tasks within a group, however it does not enforce any
> >> > constraint on how such bandwidth can be consumed.
> >> > With the integration of schedutil, the scheduler has now the proper
> >> > information about a task to select  the most suitable frequency to
> >> > satisfy tasks needs.
> >> [..]
> >>
> >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> >> > +                                    struct cftype *cft)
> >> > +{
> >> > +       struct task_group *tg;
> >> > +       u64 min_capacity;
> >> > +
> >> > +       rcu_read_lock();
> >> > +       tg = css_tg(css);
> >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> >>
> >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> >> the write path) to avoid load-tearing?
> >
> > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > memory access to write/read it, isn't it? I mean: I do not expect the
> > compiler "to mess" with these accesses.
> 
> This depends on compiler and arch. I'm not sure if its in practice
> these days an issue, but see section on 'load tearing' in
> Documentation/memory-barriers.txt . If compiler decided to break down
> the access to multiple accesses due to some reason, then might be a
> problem.

The compiler might also be able to inline cpu_capacity_min_read_u64()
fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
If min_capacity is used several times in the ensuing code, the compiler
could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
best might be a bit confusing.

> Adding Paul for his expert opinion on the matter ;)

My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
I can absolutely prove that the compiler cannot do any destructive
optimizations.  And I not-infrequently find unsuspected opportunities
for destructive optimization in my own code.  Your mileage may vary.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-15 14:44         ` Juri Lelli
@ 2017-03-15 16:13           ` Joel Fernandes
  2017-03-15 16:24             ` Juri Lelli
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Fernandes @ 2017-03-15 16:13 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On Wed, Mar 15, 2017 at 7:44 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Joel,
>
> On 15/03/17 05:59, Joel Fernandes wrote:
>> On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>> > On 13-Mar 03:08, Joel Fernandes (Google) wrote:
>> >> Hi Patrick,
>> >>
>> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>> >> <patrick.bellasi@arm.com> wrote:
>> >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
>> >> > Such a mandatory policy can be made more tunable from userspace thus
>> >> > allowing for example to define a reasonable max capacity (i.e.
>> >> > frequency) which is required for the execution of a specific RT/DL
>> >> > workload. This will contribute to make the RT class more "friendly" for
>> >> > power/energy sensible applications.
>> >> >
>> >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.
>> >> > Whenever a task in these classes is RUNNABLE, the capacity required is
>> >> > defined by the constraints of the control group that task belongs to.
>> >> >
>> >>
>> >> We briefly discussed this at Linaro Connect that this works well for
>> >> sporadic RT tasks that run briefly and then sleep for long periods of
>> >> time - so certainly this patch is good, but its only a partial
>> >> solution to the problem of frequent and short-sleepers and something
>> >> is required to keep the boost active for short non-RUNNABLE as well.
>> >> The behavior with many periodic RT tasks is that they will sleep for
>> >> short intervals and run for short intervals periodically. In this case
>> >> removing the clamp (or the boost as in schedtune v2) on a dequeue will
>> >> essentially mean during a narrow window cpufreq can drop the frequency
>> >> and only to make it go back up again.
>> >>
>> >> Currently for schedtune v2, I am working on prototyping something like
>> >> the following for Android:
>> >> - if RT task is enqueue, introduce the boost.
>> >> - When task is dequeued, start a timer for a  "minimum deboost delay
>> >> time" before taking out the boost.
>> >> - If task is enqueued again before the timer fires, then cancel the timer.
>> >>
>> >> I don't think any "fix" to this particular issue should be to the
>> >> schedutil governor and should be sorted before going to cpufreq itself
>> >> (that is before making the request). What do you think about this?
>> >
>> > My short observations are:
>> >
>> > 1) for certain RT tasks, which have a quite "predictable" activation
>> >    pattern, we should definitively try to use DEADLINE... which will
>> >    factor out all "boosting potential races" since the bandwidth
>> >    requirements are well defined at task description time.
>>
>> I don't immediately see how deadline can fix this, when a task is
>> dequeued after end of its current runtime, its bandwidth will be
>> subtracted from the active running bandwidth. This is what drives the
>> DL part of the capacity request. In this case, we run into the same
>> issue as with the boost-removal on dequeue. Isn't it?
>>
>
> Unfortunately, I still have to post the set of patches (based on Luca's
> reclaiming set) that introduces driving of clock frequency from
> DEADLINE, so I guess everything we can discuss about how DEADLINE might
> help here might be difficult to understand. :(
>
> I should definitely fix that.

I fully understand, Sorry to be discussing this too soon here...

> However, trying to quickly summarize how that would work (for who is
> already somewhat familiar with reclaiming bits):
>
>  - a task utilization contribution is accounted for (at rq level) as
>    soon as it wakes up for the first time in a new period
>  - its contribution is then removed after the 0lag time (or when the
>    task gets throttled)
>  - frequency transitions are triggered accordingly
>
> So, I don't see why triggering a go down request after the 0lag time
> expired and quickly reacting to tasks waking up would have create
> problems in your case?

In my experience, the 'reacting to tasks' bit doesn't work very well.
For short running period tasks, we need to set the frequency to
something and not ramp it down too quickly (for ex, runtime 1.5ms and
period 3ms). In this case the 0-lag time would be < 3ms. I guess if
we're going to use 0-lag time, then we'd need to set it runtime and
period to be higher than exactly matching the task's? So would we be
assigning the same bandwidth but for R/T instead of r/t (Where r, R
are the runtimes and t,T are periods, and R > r and T > t)?

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-15 16:13           ` Joel Fernandes
@ 2017-03-15 16:24             ` Juri Lelli
  2017-03-15 23:40               ` Joel Fernandes
  0 siblings, 1 reply; 66+ messages in thread
From: Juri Lelli @ 2017-03-15 16:24 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On 15/03/17 09:13, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 7:44 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi Joel,
> >
> > On 15/03/17 05:59, Joel Fernandes wrote:
> >> On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi
> >> <patrick.bellasi@arm.com> wrote:
> >> > On 13-Mar 03:08, Joel Fernandes (Google) wrote:
> >> >> Hi Patrick,
> >> >>
> >> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> >> >> <patrick.bellasi@arm.com> wrote:
> >> >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
> >> >> > Such a mandatory policy can be made more tunable from userspace thus
> >> >> > allowing for example to define a reasonable max capacity (i.e.
> >> >> > frequency) which is required for the execution of a specific RT/DL
> >> >> > workload. This will contribute to make the RT class more "friendly" for
> >> >> > power/energy sensible applications.
> >> >> >
> >> >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes.
> >> >> > Whenever a task in these classes is RUNNABLE, the capacity required is
> >> >> > defined by the constraints of the control group that task belongs to.
> >> >> >
> >> >>
> >> >> We briefly discussed this at Linaro Connect that this works well for
> >> >> sporadic RT tasks that run briefly and then sleep for long periods of
> >> >> time - so certainly this patch is good, but its only a partial
> >> >> solution to the problem of frequent and short-sleepers and something
> >> >> is required to keep the boost active for short non-RUNNABLE as well.
> >> >> The behavior with many periodic RT tasks is that they will sleep for
> >> >> short intervals and run for short intervals periodically. In this case
> >> >> removing the clamp (or the boost as in schedtune v2) on a dequeue will
> >> >> essentially mean during a narrow window cpufreq can drop the frequency
> >> >> and only to make it go back up again.
> >> >>
> >> >> Currently for schedtune v2, I am working on prototyping something like
> >> >> the following for Android:
> >> >> - if RT task is enqueue, introduce the boost.
> >> >> - When task is dequeued, start a timer for a  "minimum deboost delay
> >> >> time" before taking out the boost.
> >> >> - If task is enqueued again before the timer fires, then cancel the timer.
> >> >>
> >> >> I don't think any "fix" to this particular issue should be to the
> >> >> schedutil governor and should be sorted before going to cpufreq itself
> >> >> (that is before making the request). What do you think about this?
> >> >
> >> > My short observations are:
> >> >
> >> > 1) for certain RT tasks, which have a quite "predictable" activation
> >> >    pattern, we should definitively try to use DEADLINE... which will
> >> >    factor out all "boosting potential races" since the bandwidth
> >> >    requirements are well defined at task description time.
> >>
> >> I don't immediately see how deadline can fix this, when a task is
> >> dequeued after end of its current runtime, its bandwidth will be
> >> subtracted from the active running bandwidth. This is what drives the
> >> DL part of the capacity request. In this case, we run into the same
> >> issue as with the boost-removal on dequeue. Isn't it?
> >>
> >
> > Unfortunately, I still have to post the set of patches (based on Luca's
> > reclaiming set) that introduces driving of clock frequency from
> > DEADLINE, so I guess everything we can discuss about how DEADLINE might
> > help here might be difficult to understand. :(
> >
> > I should definitely fix that.
> 
> I fully understand, Sorry to be discussing this too soon here...
> 

No problem. I just thought I should clarify before people go WTH are
these guys talking about?! :)

> > However, trying to quickly summarize how that would work (for who is
> > already somewhat familiar with reclaiming bits):
> >
> >  - a task utilization contribution is accounted for (at rq level) as
> >    soon as it wakes up for the first time in a new period
> >  - its contribution is then removed after the 0lag time (or when the
> >    task gets throttled)
> >  - frequency transitions are triggered accordingly
> >
> > So, I don't see why triggering a go down request after the 0lag time
> > expired and quickly reacting to tasks waking up would have create
> > problems in your case?
> 
> In my experience, the 'reacting to tasks' bit doesn't work very well.

Humm.. but in this case we won't be 'reacting', we will be
'anticipating' tasks' needs, right?

> For short running period tasks, we need to set the frequency to
> something and not ramp it down too quickly (for ex, runtime 1.5ms and
> period 3ms). In this case the 0-lag time would be < 3ms. I guess if
> we're going to use 0-lag time, then we'd need to set it runtime and
> period to be higher than exactly matching the task's? So would we be
> assigning the same bandwidth but for R/T instead of r/t (Where r, R
> are the runtimes and t,T are periods, and R > r and T > t)?
> 

In general, I guess, you could let the Period be the task's period and
set Runtime to be somewhat greater that the task's runtime (so to
account for system overhead and such, e.g. hw limits regarding frequency
switching).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-15 16:10         ` Paul E. McKenney
@ 2017-03-15 16:44           ` Patrick Bellasi
  2017-03-15 17:24             ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-15 16:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo

On 15-Mar 09:10, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > <patrick.bellasi@arm.com> wrote:
> > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > >> <patrick.bellasi@arm.com> wrote:
> > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > >> > bandwidth to tasks within a group, however it does not enforce any
> > >> > constraint on how such bandwidth can be consumed.
> > >> > With the integration of schedutil, the scheduler has now the proper
> > >> > information about a task to select  the most suitable frequency to
> > >> > satisfy tasks needs.
> > >> [..]
> > >>
> > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > >> > +                                    struct cftype *cft)
> > >> > +{
> > >> > +       struct task_group *tg;
> > >> > +       u64 min_capacity;
> > >> > +
> > >> > +       rcu_read_lock();
> > >> > +       tg = css_tg(css);
> > >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > >>
> > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > >> the write path) to avoid load-tearing?
> > >
> > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > compiler "to mess" with these accesses.
> > 
> > This depends on compiler and arch. I'm not sure if its in practice
> > these days an issue, but see section on 'load tearing' in
> > Documentation/memory-barriers.txt . If compiler decided to break down
> > the access to multiple accesses due to some reason, then might be a
> > problem.
> 
> The compiler might also be able to inline cpu_capacity_min_read_u64()
> fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> If min_capacity is used several times in the ensuing code, the compiler
> could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> best might be a bit confusing.

That's actually an interesting case, however I don't think it applies
in this case since cpu_capacity_min_read_u64() is called only via
a function poninter and thus it will never be inlined. isn't it?

> > Adding Paul for his expert opinion on the matter ;)
> 
> My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> I can absolutely prove that the compiler cannot do any destructive
> optimizations.  And I not-infrequently find unsuspected opportunities
> for destructive optimization in my own code.  Your mileage may vary.  ;-)

I guess here the main concern from Joel is that such a pattern:

   u64 var = unsigned_int_value_from_memory;

can result is a couple of "load from memory" operations.

In that case a similar:

  unsigned_int_left_value = new_unsigned_int_value;

executed on a different thread can overlap with the previous memory
read operations and ending up in "var" containing a not consistent
value.

Question is: can this really happen, given the data types in use?


> 							Thanx, Paul

Thanks! ;-)

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-15 16:44           ` Patrick Bellasi
@ 2017-03-15 17:24             ` Paul E. McKenney
  2017-03-15 17:57               ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2017-03-15 17:24 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Joel Fernandes, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo

On Wed, Mar 15, 2017 at 04:44:39PM +0000, Patrick Bellasi wrote:
> On 15-Mar 09:10, Paul E. McKenney wrote:
> > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > > <patrick.bellasi@arm.com> wrote:
> > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > > >> <patrick.bellasi@arm.com> wrote:
> > > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > > >> > bandwidth to tasks within a group, however it does not enforce any
> > > >> > constraint on how such bandwidth can be consumed.
> > > >> > With the integration of schedutil, the scheduler has now the proper
> > > >> > information about a task to select  the most suitable frequency to
> > > >> > satisfy tasks needs.
> > > >> [..]
> > > >>
> > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > > >> > +                                    struct cftype *cft)
> > > >> > +{
> > > >> > +       struct task_group *tg;
> > > >> > +       u64 min_capacity;
> > > >> > +
> > > >> > +       rcu_read_lock();
> > > >> > +       tg = css_tg(css);
> > > >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > > >>
> > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > > >> the write path) to avoid load-tearing?
> > > >
> > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > > compiler "to mess" with these accesses.
> > > 
> > > This depends on compiler and arch. I'm not sure if its in practice
> > > these days an issue, but see section on 'load tearing' in
> > > Documentation/memory-barriers.txt . If compiler decided to break down
> > > the access to multiple accesses due to some reason, then might be a
> > > problem.
> > 
> > The compiler might also be able to inline cpu_capacity_min_read_u64()
> > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> > If min_capacity is used several times in the ensuing code, the compiler
> > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> > best might be a bit confusing.
> 
> That's actually an interesting case, however I don't think it applies
> in this case since cpu_capacity_min_read_u64() is called only via
> a function poninter and thus it will never be inlined. isn't it?
> 
> > > Adding Paul for his expert opinion on the matter ;)
> > 
> > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> > I can absolutely prove that the compiler cannot do any destructive
> > optimizations.  And I not-infrequently find unsuspected opportunities
> > for destructive optimization in my own code.  Your mileage may vary.  ;-)
> 
> I guess here the main concern from Joel is that such a pattern:
> 
>    u64 var = unsigned_int_value_from_memory;
> 
> can result is a couple of "load from memory" operations.

Indeed it can.  I first learned this the hard way in the early 1990s,
so 20-year-old compiler optimizations are quite capable of making this
sort of thing happen.

> In that case a similar:
> 
>   unsigned_int_left_value = new_unsigned_int_value;
> 
> executed on a different thread can overlap with the previous memory
> read operations and ending up in "var" containing a not consistent
> value.
> 
> Question is: can this really happen, given the data types in use?

So we have an updater changing the value of unsigned_int_left_value,
while readers in other threads are accessing it, correct?  And you
are asking whether the compiler can optimize the updater so as to
mess up the readers, right?

One such optimization would be a byte-wise write, though I have no
idea why a compiler would do such a thing assuming that the variable
is reasonably sized and aligned.  Another is that the compiler could
use the variable as temporary storage just before the assignment.
(You haven't told the compiler that anyone else is reading it, though
I don't know of this being done by production compilers.)  A third is
that the compiler could fuse successive stores, which might or might
not be a problem, depending.

Probably more, but that should be a start.  ;-)

							Thanx, Paul

> Thanks! ;-)
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-15 17:24             ` Paul E. McKenney
@ 2017-03-15 17:57               ` Patrick Bellasi
  0 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-15 17:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Tejun Heo

On 15-Mar 10:24, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 04:44:39PM +0000, Patrick Bellasi wrote:
> > On 15-Mar 09:10, Paul E. McKenney wrote:
> > > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > > > <patrick.bellasi@arm.com> wrote:
> > > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > > > >> <patrick.bellasi@arm.com> wrote:
> > > > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > > > >> > bandwidth to tasks within a group, however it does not enforce any
> > > > >> > constraint on how such bandwidth can be consumed.
> > > > >> > With the integration of schedutil, the scheduler has now the proper
> > > > >> > information about a task to select  the most suitable frequency to
> > > > >> > satisfy tasks needs.
> > > > >> [..]
> > > > >>
> > > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > > > >> > +                                    struct cftype *cft)
> > > > >> > +{
> > > > >> > +       struct task_group *tg;
> > > > >> > +       u64 min_capacity;
> > > > >> > +
> > > > >> > +       rcu_read_lock();
> > > > >> > +       tg = css_tg(css);
> > > > >> > +       min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > > > >>
> > > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > > > >> the write path) to avoid load-tearing?
> > > > >
> > > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > > > compiler "to mess" with these accesses.
> > > > 
> > > > This depends on compiler and arch. I'm not sure if its in practice
> > > > these days an issue, but see section on 'load tearing' in
> > > > Documentation/memory-barriers.txt . If compiler decided to break down
> > > > the access to multiple accesses due to some reason, then might be a
> > > > problem.
> > > 
> > > The compiler might also be able to inline cpu_capacity_min_read_u64()
> > > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> > > If min_capacity is used several times in the ensuing code, the compiler
> > > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> > > best might be a bit confusing.
> > 
> > That's actually an interesting case, however I don't think it applies
> > in this case since cpu_capacity_min_read_u64() is called only via
> > a function poninter and thus it will never be inlined. isn't it?
> > 
> > > > Adding Paul for his expert opinion on the matter ;)
> > > 
> > > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> > > I can absolutely prove that the compiler cannot do any destructive
> > > optimizations.  And I not-infrequently find unsuspected opportunities
> > > for destructive optimization in my own code.  Your mileage may vary.  ;-)
> > 
> > I guess here the main concern from Joel is that such a pattern:
> > 
> >    u64 var = unsigned_int_value_from_memory;
> > 
> > can result is a couple of "load from memory" operations.
> 
> Indeed it can.  I first learned this the hard way in the early 1990s,
> so 20-year-old compiler optimizations are quite capable of making this
> sort of thing happen.
> 
> > In that case a similar:
> > 
> >   unsigned_int_left_value = new_unsigned_int_value;
> > 
> > executed on a different thread can overlap with the previous memory
> > read operations and ending up in "var" containing a not consistent
> > value.
> > 
> > Question is: can this really happen, given the data types in use?
> 
> So we have an updater changing the value of unsigned_int_left_value,
> while readers in other threads are accessing it, correct?  And you
> are asking whether the compiler can optimize the updater so as to
> mess up the readers, right?
> 
> One such optimization would be a byte-wise write, though I have no
> idea why a compiler would do such a thing assuming that the variable
> is reasonably sized and aligned.  Another is that the compiler could
> use the variable as temporary storage just before the assignment.
> (You haven't told the compiler that anyone else is reading it, though
> I don't know of this being done by production compilers.)  A third is
> that the compiler could fuse successive stores, which might or might
> not be a problem, depending.
> 
> Probably more, but that should be a start.  ;-)

Definitively yes! :)

Thanks for the detailed explanation, I'll add the READ_ONCE as
Joel suggested! +1

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-15 16:24             ` Juri Lelli
@ 2017-03-15 23:40               ` Joel Fernandes
  2017-03-16 11:16                 ` Juri Lelli
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Fernandes @ 2017-03-15 23:40 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
[..]
>
>> > However, trying to quickly summarize how that would work (for who is
>> > already somewhat familiar with reclaiming bits):
>> >
>> >  - a task utilization contribution is accounted for (at rq level) as
>> >    soon as it wakes up for the first time in a new period
>> >  - its contribution is then removed after the 0lag time (or when the
>> >    task gets throttled)
>> >  - frequency transitions are triggered accordingly
>> >
>> > So, I don't see why triggering a go down request after the 0lag time
>> > expired and quickly reacting to tasks waking up would have create
>> > problems in your case?
>>
>> In my experience, the 'reacting to tasks' bit doesn't work very well.
>
> Humm.. but in this case we won't be 'reacting', we will be
> 'anticipating' tasks' needs, right?

Are you saying we will start ramping frequency before the next
activation so that we're ready for it?

If not, it sounds like it will only make the frequency request on the
next activation when the Active bandwidth increases due to the task
waking up. By then task has already started to run, right?

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-15 12:59   ` Patrick Bellasi
@ 2017-03-16  1:04     ` Rafael J. Wysocki
  2017-03-16  3:15       ` Joel Fernandes
  2017-03-16 12:23       ` Patrick Bellasi
  0 siblings, 2 replies; 66+ messages in thread
From: Rafael J. Wysocki @ 2017-03-16  1:04 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Linux PM,
	Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Todd Kjos, Tim Murray,
	Andres Oportus, Joel Fernandes, Juri Lelli, Morten Rasmussen,
	Dietmar Eggemann

On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 15-Mar 12:41, Rafael J. Wysocki wrote:
>> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
>> > Was: SchedTune: central, scheduler-driven, power-perfomance control
>> >
>> > This series presents a possible alternative design for what has been presented
>> > in the past as SchedTune. This redesign has been defined to address the main
>> > concerns and comments collected in the LKML discussion [1] as well at the last
>> > LPC [2].
>> > The aim of this posting is to present a working prototype which implements
>> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.
>> >
>> > The main differences with respect to the previous proposal [1] are:
>> >  1. Task boosting/capping is now implemented as an extension on top of
>> >     the existing CGroup CPU controller.
>> >  2. The previous boosting strategy, based on the inflation of the CPU's
>> >     utilization, has been now replaced by a more simple yet effective set
>> >     of capacity constraints.
>> >
>> > The proposed approach allows to constrain the minimum and maximum capacity
>> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
>> > The set of active constraints are tracked by the core scheduler, thus they
>> > apply across all the scheduling classes. The value of the constraints are
>> > used to clamp the CPU utilization when the schedutil CPUFreq's governor
>> > selects a frequency for that CPU.
>> >
>> > This means that the new proposed approach allows to extend the concept of
>> > tasks classification to frequencies selection, thus allowing informed
>> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
>> > optimization policies such as:
>> >  a) Boosting of important tasks, by enforcing a minimum capacity in the
>> >     CPUs where they are enqueued for execution.
>> >  b) Capping of background tasks, by enforcing a maximum capacity.
>> >  c) Containment of OPPs for RT tasks which cannot easily be switched to
>> >     the usage of the DL class, but still don't need to run at the maximum
>> >     frequency.
>>
>> Do you have any practical examples of that, like for example what exactly
>> Android is going to use this for?
>
> In general, every "informed run-time" usually know quite a lot about
> tasks requirements and how they impact the user experience.
>
> In Android for example tasks are classified depending on their _current_
> role. We can distinguish for example between:
>
> - TOP_APP:    which are tasks currently affecting the UI, i.e. part of
>               the app currently in foreground
> - BACKGROUND: which are tasks not directly impacting the user
>               experience
>
> Given these information it could make sense to adopt different
> service/optimization policy for different tasks.
> For example, we can be interested in
> giving maximum responsiveness to TOP_APP tasks while we still want to
> be able to save as much energy as possible for the BACKGROUND tasks.
>
> That's where the proposal in this series (partially) comes on hand.

A question: Does "responsiveness" translate directly to "capacity" somehow?

Moreover, how exactly is "responsiveness" defined?

> What we propose is a "standard" interface to collect sensible
> information from "informed run-times" which can be used to:
>
> a) classify tasks according to the main optimization goals:
>    performance boosting vs energy saving
>
> b) support a more dynamic tuning of kernel side behaviors, mainly
>    OPPs selection and tasks placement
>
> Regarding this last point, this series specifically represents a
> proposal for the integration with schedutil. The main usages we are
> looking for in Android are:
>
> a) Boosting the OPP selected for certain critical tasks, with the goal
>    to speed-up their completion regardless of (potential) energy impacts.
>    A kind-of "race-to-idle" policy for certain tasks.

It looks like this could be addressed by adding a "this task should
race to idle" flag too.

> b) Capping the OPP selection for certain non critical tasks, which is
>    a major concerns especially for RT tasks in mobile context, but
>    it also apply to FAIR tasks representing background activities.

Well, is the information on how much CPU capacity assign to those
tasks really there in user space?  What's the source of it if so?

>> I gather that there is some experience with the current EAS implementation
>> there, so I wonder how this work is related to that.
>
> You right. We started developing a task boosting strategy a couple of
> years ago. The first implementation we did is what is currently in use
> by the EAS version in used on Pixel smartphones.
>
> Since the beginning our attitude has always been "mainline first".
> However, we found it extremely valuable to proof both interface's
> design and feature's benefits on real devices. That's why we keep
> backporting these bits on different Android kernels.
>
> Google, which primary representatives are in CC, is also quite focused
> on using mainline solutions for their current and future solutions.
> That's why, after the release of the Pixel devices end of last year,
> we refreshed and posted the proposal on LKML [1] and collected a first
> run of valuable feedbacks at LCP [2].

Thanks for the info, but my question was more about how it was related
from the technical angle.  IOW, there surely is some experience
related to how user space can deal with energy problems and I would
expect that experience to be an important factor in designing a kernel
interface for that user space, so I wonder if any particular needs of
the Android user space are addressed here.

I'm not intimately familiar with Android, so I guess I would like to
be educated somewhat on that. :-)

> This posting is an expression of the feedbacks collected so far and
> the main goal for us are:
> 1) validate once more the soundness of a scheduler-driven run-time
>    power-performance control which is based on information collected
>    from informed run-time
> 2) get an agreement on whether the current interface can be considered
>    sufficiently "mainline friendly" to have a chance to get merged
> 3) rework/refactor what is required if point 2 is not (yet) satisfied

My definition of "mainline friendly" may be different from a someone
else's one, but I usually want to know two things:
 1. What problem exactly is at hand.
 2. What alternative ways of addressing it have been considered and
why the particular one proposed has been chosen over the other ones.

At the moment I don't feel like I have enough information in both aspects.

For example, if you said "Android wants to do XYZ because of ABC and
that's how we want to make that possible, and it also could be done in
the other GHJ ways, but they are not attractive and here's why etc"
that would help quite a bit from my POV.

> It's worth to notice that these bits are completely independent from
> EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
> itself and it can be quite useful in many different scenarios where
> EAS is not used at all. A simple example is making schedutil to behave
> concurrently like the powersave governor for certain tasks and the
> performance governor for other tasks.

That's fine in theory, but honestly an interface like this will be a
maintenance burden and adding it just because it may be useful to
somebody sounds not serious enough.

IOW, I'd like to be able to say "This is going to be used by user
space X to do A and that's how etc" is somebody asks me about that
which honestly I can't at this point.

>
> As a final remark, this series is going to be a discussion topic in
> the upcoming OSPM summit [3]. It would be nice if we can get there
> with a sufficient knowledge of the main goals and the current status.

I'm not sure what you mean here, sorry.

> However, please let's keep discussing here about all the possible
> concerns which can be raised about this proposal.

OK

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-16  1:04     ` Rafael J. Wysocki
@ 2017-03-16  3:15       ` Joel Fernandes
  2017-03-20 22:51         ` Rafael J. Wysocki
  2017-03-16 12:23       ` Patrick Bellasi
  1 sibling, 1 reply; 66+ messages in thread
From: Joel Fernandes @ 2017-03-16  3:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Patrick Bellasi, Rafael J. Wysocki, Linux Kernel Mailing List,
	Linux PM, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Juri Lelli,
	Morten Rasmussen, Dietmar Eggemann

Hi Rafael,

On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
>>> Do you have any practical examples of that, like for example what exactly
>>> Android is going to use this for?
>>
>> In general, every "informed run-time" usually know quite a lot about
>> tasks requirements and how they impact the user experience.
>>
>> In Android for example tasks are classified depending on their _current_
>> role. We can distinguish for example between:
>>
>> - TOP_APP:    which are tasks currently affecting the UI, i.e. part of
>>               the app currently in foreground
>> - BACKGROUND: which are tasks not directly impacting the user
>>               experience
>>
>> Given these information it could make sense to adopt different
>> service/optimization policy for different tasks.
>> For example, we can be interested in
>> giving maximum responsiveness to TOP_APP tasks while we still want to
>> be able to save as much energy as possible for the BACKGROUND tasks.
>>
>> That's where the proposal in this series (partially) comes on hand.
>
> A question: Does "responsiveness" translate directly to "capacity" somehow?
>
> Moreover, how exactly is "responsiveness" defined?

Responsiveness is basically how quickly the UI is responding to user
interaction after doing its computation, application-logic and
rendering. Android apps have 2 important threads, the main thread (or
UI thread) which does all the work and computation for the app, and a
Render thread which does the rendering and submission of frames to
display pipeline for further composition and display.

We wish to bias towards performance than energy for this work since
this front facing to the user and we don't care about much about
energy for these tasks at this point, what's most critical is
completion as quickly as possible so the user experience doesn't
suffer from a performance issue that is noticeable.

One metric to define this is "Jank" where we drop frames and aren't
able to render on time. One of the reasons this can happen because the
main thread (UI thread) took longer than expected for some
computation. Whatever the interface - we'd just like to bias the
scheduling and frequency guidance to be more concerned with
performance and less with energy. And use this information for both
frequency selection and task placement. 'What we need' is also app
dependent since every app has its own main thread and is free to
compute whatever it needs. So Android can't estimate this - but we do
know that this app is user facing so in broad terms the interface is
used to say please don't sacrifice performance for these top-apps -
without accurately defining what these performance needs really are
because we don't know it.
For YouTube app for example, the complexity of the video decoding and
the frame rate are very variable depending on the encoding scheme and
the video being played. The flushing of the frames through the display
pipeline is also variable (frame rate depends on the video being
decoded), so this work is variable and we can't say for sure in
definitive terms how much capacity we need.

What we can do is with Patrick's work, we can take the worst case
based on measurements and specify say we need atleast this much
capacity regardless of what load-tracking thinks we need and then we
can scale frequency accordingly. This is the usecase for the minimum
capacity in his clamping patch. This is still not perfect in terms of
defining something accurately because - we don't even know how much we
need, but atleast in broad terms we have some way of telling the
governor to maintain atleast X capacity.

For the clamping of maximum capacity, there are usecases like
background tasks like Patrick said, but also usecases where we don't
want to run at max frequency even though load-tracking thinks that we
need to. For example, there are case where for foreground camera
tasks, where we want to provide sustainable performance without
entering thermal throttling, so the capping will help there.

>> What we propose is a "standard" interface to collect sensible
>> information from "informed run-times" which can be used to:
>>
>> a) classify tasks according to the main optimization goals:
>>    performance boosting vs energy saving
>>
>> b) support a more dynamic tuning of kernel side behaviors, mainly
>>    OPPs selection and tasks placement
>>
>> Regarding this last point, this series specifically represents a
>> proposal for the integration with schedutil. The main usages we are
>> looking for in Android are:
>>
>> a) Boosting the OPP selected for certain critical tasks, with the goal
>>    to speed-up their completion regardless of (potential) energy impacts.
>>    A kind-of "race-to-idle" policy for certain tasks.
>
> It looks like this could be addressed by adding a "this task should
> race to idle" flag too.

But he said 'kind-of' race-to-idle. Racing to idle all the time for
ex. at max frequency will be wasteful of energy so although we don't
care about energy much for top-apps, we do care a bit.

>
>> b) Capping the OPP selection for certain non critical tasks, which is
>>    a major concerns especially for RT tasks in mobile context, but
>>    it also apply to FAIR tasks representing background activities.
>
> Well, is the information on how much CPU capacity assign to those
> tasks really there in user space?  What's the source of it if so?

I believe this is just a matter of tuning and modeling for what is
needed. For ex. to prevent thermal throttling as I mentioned and also
to ensure background activities aren't running at highest frequency
and consuming excessive energy (since racing to idle at higher
frequency is more expensive energy than running slower to idle since
we run at higher voltages at higher frequency and the slow of the
perf/W curve is steeper - p = c * V^2 * F. So the V component being
higher just drains more power quadratic-ally which is of no use to
background tasks - infact in some tests, we're just as happy with
setting them at much lower frequencies than what load-tracking thinks
is needed.

>>> I gather that there is some experience with the current EAS implementation
>>> there, so I wonder how this work is related to that.
>>
>> You right. We started developing a task boosting strategy a couple of
>> years ago. The first implementation we did is what is currently in use
>> by the EAS version in used on Pixel smartphones.
>>
>> Since the beginning our attitude has always been "mainline first".
>> However, we found it extremely valuable to proof both interface's
>> design and feature's benefits on real devices. That's why we keep
>> backporting these bits on different Android kernels.
>>
>> Google, which primary representatives are in CC, is also quite focused
>> on using mainline solutions for their current and future solutions.
>> That's why, after the release of the Pixel devices end of last year,
>> we refreshed and posted the proposal on LKML [1] and collected a first
>> run of valuable feedbacks at LCP [2].
>
> Thanks for the info, but my question was more about how it was related
> from the technical angle.  IOW, there surely is some experience
> related to how user space can deal with energy problems and I would
> expect that experience to be an important factor in designing a kernel
> interface for that user space, so I wonder if any particular needs of
> the Android user space are addressed here.
>
> I'm not intimately familiar with Android, so I guess I would like to
> be educated somewhat on that. :-)

Hope this sheds some light into the Android side of things a bit.

Regards,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-15 23:40               ` Joel Fernandes
@ 2017-03-16 11:16                 ` Juri Lelli
  2017-03-16 12:27                   ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Juri Lelli @ 2017-03-16 11:16 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On 15/03/17 16:40, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> [..]
> >
> >> > However, trying to quickly summarize how that would work (for who is
> >> > already somewhat familiar with reclaiming bits):
> >> >
> >> >  - a task utilization contribution is accounted for (at rq level) as
> >> >    soon as it wakes up for the first time in a new period
> >> >  - its contribution is then removed after the 0lag time (or when the
> >> >    task gets throttled)
> >> >  - frequency transitions are triggered accordingly
> >> >
> >> > So, I don't see why triggering a go down request after the 0lag time
> >> > expired and quickly reacting to tasks waking up would have create
> >> > problems in your case?
> >>
> >> In my experience, the 'reacting to tasks' bit doesn't work very well.
> >
> > Humm.. but in this case we won't be 'reacting', we will be
> > 'anticipating' tasks' needs, right?
> 
> Are you saying we will start ramping frequency before the next
> activation so that we're ready for it?
> 

I'm saying that there is no need to ramp, simply select the frequency
that is needed for a task (or a set of them).

> If not, it sounds like it will only make the frequency request on the
> next activation when the Active bandwidth increases due to the task
> waking up. By then task has already started to run, right?
> 

When the task is enqueued back we select the frequency considering its
bandwidth request (and the bandwidth/utilization of the others). So,
when it actually starts running it will already have enough capacity to
finish in time.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-16  1:04     ` Rafael J. Wysocki
  2017-03-16  3:15       ` Joel Fernandes
@ 2017-03-16 12:23       ` Patrick Bellasi
  1 sibling, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-16 12:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Linux PM,
	Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
	Paul Turner, Vincent Guittot, John Stultz, Todd Kjos, Tim Murray,
	Andres Oportus, Joel Fernandes, Juri Lelli, Morten Rasmussen,
	Dietmar Eggemann

On 16-Mar 02:04, Rafael J. Wysocki wrote:
> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > On 15-Mar 12:41, Rafael J. Wysocki wrote:
> >> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> >> > Was: SchedTune: central, scheduler-driven, power-perfomance control
> >> >
> >> > This series presents a possible alternative design for what has been presented
> >> > in the past as SchedTune. This redesign has been defined to address the main
> >> > concerns and comments collected in the LKML discussion [1] as well at the last
> >> > LPC [2].
> >> > The aim of this posting is to present a working prototype which implements
> >> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.
> >> >
> >> > The main differences with respect to the previous proposal [1] are:
> >> >  1. Task boosting/capping is now implemented as an extension on top of
> >> >     the existing CGroup CPU controller.
> >> >  2. The previous boosting strategy, based on the inflation of the CPU's
> >> >     utilization, has been now replaced by a more simple yet effective set
> >> >     of capacity constraints.
> >> >
> >> > The proposed approach allows to constrain the minimum and maximum capacity
> >> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
> >> > The set of active constraints are tracked by the core scheduler, thus they
> >> > apply across all the scheduling classes. The value of the constraints are
> >> > used to clamp the CPU utilization when the schedutil CPUFreq's governor
> >> > selects a frequency for that CPU.
> >> >
> >> > This means that the new proposed approach allows to extend the concept of
> >> > tasks classification to frequencies selection, thus allowing informed
> >> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
> >> > optimization policies such as:
> >> >  a) Boosting of important tasks, by enforcing a minimum capacity in the
> >> >     CPUs where they are enqueued for execution.
> >> >  b) Capping of background tasks, by enforcing a maximum capacity.
> >> >  c) Containment of OPPs for RT tasks which cannot easily be switched to
> >> >     the usage of the DL class, but still don't need to run at the maximum
> >> >     frequency.
> >>
> >> Do you have any practical examples of that, like for example what exactly
> >> Android is going to use this for?
> >
> > In general, every "informed run-time" usually know quite a lot about
> > tasks requirements and how they impact the user experience.
> >
> > In Android for example tasks are classified depending on their _current_
> > role. We can distinguish for example between:
> >
> > - TOP_APP:    which are tasks currently affecting the UI, i.e. part of
> >               the app currently in foreground
> > - BACKGROUND: which are tasks not directly impacting the user
> >               experience
> >
> > Given these information it could make sense to adopt different
> > service/optimization policy for different tasks.
> > For example, we can be interested in
> > giving maximum responsiveness to TOP_APP tasks while we still want to
> > be able to save as much energy as possible for the BACKGROUND tasks.
> >
> > That's where the proposal in this series (partially) comes on hand.
> 
> A question: Does "responsiveness" translate directly to "capacity" somehow?
> 
> Moreover, how exactly is "responsiveness" defined?

A) "responsiveness" correlates somehow with "capacity". It's subject
   to profiling which, for some critical system components, can be
   done in an app-independent way.

   Optimization of the rendering pipeline is an example. Other system
   services, which are provided by Android to all applications, are
   also examples of where the integrator can tune and optimize to
   give benefits across all apps.

B) the definition of "responsiveness", from a certain perspective, is
   more "qualitative" than "quantitative".

   Android is aware about different "application contexts", TOP_APP vs
   FOREGROUND is just an example (there are others).
   Thus, the run-time has the knowledge about the "qualitative
   responsiveness" required by each context.

   Moreover, Andoid integrators knows about the specific HW they are
   targeting.  This knowledge in addition to the "application
   contexts", in our experience, it allows Android to feed valuable
   input to both the scheduler and schedutil.

Of course, as Joel pointed out in his previous response,
responsiveness has also a "quantitative" definition, where "jank
frames" is the main metric in the Android world. With the help of the
propose interface we provide a useful interface for integrators to
tune their platform for the power-vs-performance trade-off they most
like.

> > What we propose is a "standard" interface to collect sensible
> > information from "informed run-times" which can be used to:
> >
> > a) classify tasks according to the main optimization goals:
> >    performance boosting vs energy saving
> >
> > b) support a more dynamic tuning of kernel side behaviors, mainly
> >    OPPs selection and tasks placement
> >
> > Regarding this last point, this series specifically represents a
> > proposal for the integration with schedutil. The main usages we are
> > looking for in Android are:
> >
> > a) Boosting the OPP selected for certain critical tasks, with the goal
> >    to speed-up their completion regardless of (potential) energy impacts.
> >    A kind-of "race-to-idle" policy for certain tasks.
> 
> It looks like this could be addressed by adding a "this task should
> race to idle" flag too.

With the proposed interface we don't need an additional flag. If you
set capacity_min=capacity_max=1024 then you are informing schedutil,
and the scheduler as well, that this task would like to race-to-idle.

I say "would like" because here we are not proposing a mandatory
interface but we are still in the domain of "best effort" guarantees.

> > b) Capping the OPP selection for certain non critical tasks, which is
> >    a major concerns especially for RT tasks in mobile context, but
> >    it also apply to FAIR tasks representing background activities.
> 
> Well, is the information on how much CPU capacity assign to those
> tasks really there in user space?  What's the source of it if so?

I think my previous comment, two paragraphs above, should have
contributed to address this question.

I'm still wondering if you are after a formal, scientific and
mathematical definition of CPU capacity demands?
Because in that case it's worth to stress that this is not the aim of
the proposed interface.

If you have such detailed information you are probably better
positioned to got for a different solution, perhaps using DEADLINE.
If instead you are dealing with FAIR tasks but still find not
sufficient the (completely application-context transparent) in-kernel
utilization tracking mechanism, than you can give value to any kind of
user-space input about tasks requirements in each and every instant.

Notice that these requirements are not set by tasks themselves but
instead they come from the run-time knowledge.
Thus, the main point is not "how to precisely measure CPU demands" but
how to feed additional and useful _context sensitive_ information from
user-space to kernel-space.

> >> I gather that there is some experience with the current EAS implementation
> >> there, so I wonder how this work is related to that.
> >
> > You right. We started developing a task boosting strategy a couple of
> > years ago. The first implementation we did is what is currently in use
> > by the EAS version in used on Pixel smartphones.
> >
> > Since the beginning our attitude has always been "mainline first".
> > However, we found it extremely valuable to proof both interface's
> > design and feature's benefits on real devices. That's why we keep
> > backporting these bits on different Android kernels.
> >
> > Google, which primary representatives are in CC, is also quite focused
> > on using mainline solutions for their current and future solutions.
> > That's why, after the release of the Pixel devices end of last year,
> > we refreshed and posted the proposal on LKML [1] and collected a first
> > run of valuable feedbacks at LCP [2].
> 
> Thanks for the info, but my question was more about how it was related
> from the technical angle.  IOW, there surely is some experience
> related to how user space can deal with energy problems and I would
> expect that experience to be an important factor in designing a kernel
> interface for that user space, so I wonder if any particular needs of
> the Android user space are addressed here.

We are not addressing specific needs of the Android user-space,
although we used Android as our main design and testing support
vehicle.
Still, the concepts covered by this proposal aims to be suitable for a
better integration of each "informed run-times" running on top of the
Linux kernel.

> I'm not intimately familiar with Android, so I guess I would like to
> be educated somewhat on that. :-)

Android is just one of such possible run-times, and a notable
representative of the mobile world.

ChromeOS is another notable potential user, which is mainly
representative of the laptops/clamshell world.

Finally, every "container manager", mainly used in server domain,
can potentially get benefits from the proposed interface (e.g.
kubernets).

The point here is that we have many different instances of user-space
run-times which know a lot more about the "user-space contexts" than
what we can aim to figure out by just working in kernel-space.

What we propose is a simple, best-effort and generic interface to feed
some of these information to kernel-space, thus supporting and
integrating already available policies and mechanisms.

> > This posting is an expression of the feedbacks collected so far and
> > the main goal for us are:
> > 1) validate once more the soundness of a scheduler-driven run-time
> >    power-performance control which is based on information collected
> >    from informed run-time
> > 2) get an agreement on whether the current interface can be considered
> >    sufficiently "mainline friendly" to have a chance to get merged
> > 3) rework/refactor what is required if point 2 is not (yet) satisfied
> 
> My definition of "mainline friendly" may be different from a someone
> else's one, but I usually want to know two things:
>  1. What problem exactly is at hand.

Feed "context aware" information about tasks requirements from
"informed run-times" to kernel-space to integrate/improve existing
decision policies for OPPs selections and tasks placement.


>  2. What alternative ways of addressing it have been considered and

We initially considered and evaluated what was possible to achieve by
just using existing APIs.
For example, we considered different combinations of:

- tuning task-affinity: which sounds too much like scheduling from
  user-space and does not have biasing on OPPs selection.

- tuning tasks-priorities: which is a concept mainly devoted to
  partitioning of the available bandwidth among RUNNABLE tasks within
  the same CPU.

- tuning 'cpusets' and/or 'cpu' controllers: which can be used to bias
  task placement but still it sounds like scheduling from user-space
  and they are missing the biasing on OPPs selection.

All these interfaces was not completely satisfying mainly because it
seemed to abuse their usage for a different scope.

Since the main goals are to bias OPP selection and tasks placement
based on application context, what we identified _initially_ was a
new CGroup based interface to tag tasks with a "boost" value.
That proposal [1] has been considered not suitable for a proper
kernel integration and thus, discussing with PeterZ, Tejun and PaulT
we identified a different proposal [2] which is what this series
implements.


> why the particular one proposed has been chosen over the other ones.

The current proposal has been chosen because:

1) it satisfy the main goal to have a simple interface which allows
   "informed run-time" (like Android but not limited to it) to feed
   "context aware" information related to user-space applications.

2) it allows to use this information to bias existing policies for
   both "OPP selection" (presented in this series) as well as "task
   placement" (as an extension on top of this series).

3) it extend the existing CPU controller, which is already devoted to
   control the available CPU bandwidth, thus allowing for a consistent
   view on how this resource is allocated to tasks.

4) it does not enforce by default any new/different behaviors (for
   example on OPP selection) but it just open possibilities for finer
   tuning whenever necessary.

5) it has almost negligible run-time overhead, mainly defined by the
   complexity of a couple of RBTree operations per each task
   wakeup/suspend.

> At the moment I don't feel like I have enough information in both aspects.

Hope the previous points cast some light on both aspects.

> For example, if you said "Android wants to do XYZ because of ABC and
> that's how we want to make that possible, and it also could be done in
> the other GHJ ways, but they are not attractive and here's why etc"
> that would help quite a bit from my POV.

Main issue for others solutions we evaluated so far is that they are
missing a clean and simple interface to express "context awareness"
at a task group level.

CGroups is the Linux framework devoted to the collection and tracking
of task group's properties. What we propose leverage this concept by
extending it just as much as required to support the dual goal of
biasing "OPPs selection" and "tasks placement" without really
requiring to re-implement these concepts in user-space.

Do you see other possible solutions?

> > It's worth to notice that these bits are completely independent from
> > EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
> > itself and it can be quite useful in many different scenarios where
> > EAS is not used at all. A simple example is making schedutil to behave
> > concurrently like the powersave governor for certain tasks and the
> > performance governor for other tasks.
> 
> That's fine in theory, but honestly an interface like this will be a
> maintenance burden and adding it just because it may be useful to
> somebody sounds not serious enough.

Actually, it is already useful to "someone". Google is using something
similar on Pixel devices and in the future it will be likely adopted
by other smartphones.

Here we are just trying to push it mainline to make it available also
to all the other potential clients I've described before.

> IOW, I'd like to be able to say "This is going to be used by user
> space X to do A and that's how etc" is somebody asks me about that
> which honestly I can't at this point.

In that case, again I think we have a strong case for "this is going
to be used by".

> > As a final remark, this series is going to be a discussion topic in
> > the upcoming OSPM summit [3]. It would be nice if we can get there
> > with a sufficient knowledge of the main goals and the current status.
> 
> I'm not sure what you mean here, sorry.

Just that I like this discussion and I would like to get some sort of
initial agreement at least on basic concepts, requirements and
use-cases before OSPM.

That would allow us to be more active on the technical details side
during the summit and, hopefully, come to the definition of a roadmap
detailing the required steps to get merged a suitable interface,
whether is the one proposed by this series or another achieving the
same goals.

> > However, please let's keep discussing here about all the possible
> > concerns which can be raised about this proposal.
> 
> OK
> 
> Thanks,
> Rafael

[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-16 11:16                 ` Juri Lelli
@ 2017-03-16 12:27                   ` Patrick Bellasi
  2017-03-16 12:44                     ` Juri Lelli
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-16 12:27 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Joel Fernandes, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On 16-Mar 11:16, Juri Lelli wrote:
> On 15/03/17 16:40, Joel Fernandes wrote:
> > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > [..]
> > >
> > >> > However, trying to quickly summarize how that would work (for who is
> > >> > already somewhat familiar with reclaiming bits):
> > >> >
> > >> >  - a task utilization contribution is accounted for (at rq level) as
> > >> >    soon as it wakes up for the first time in a new period
> > >> >  - its contribution is then removed after the 0lag time (or when the
> > >> >    task gets throttled)
> > >> >  - frequency transitions are triggered accordingly
> > >> >
> > >> > So, I don't see why triggering a go down request after the 0lag time
> > >> > expired and quickly reacting to tasks waking up would have create
> > >> > problems in your case?
> > >>
> > >> In my experience, the 'reacting to tasks' bit doesn't work very well.
> > >
> > > Humm.. but in this case we won't be 'reacting', we will be
> > > 'anticipating' tasks' needs, right?
> > 
> > Are you saying we will start ramping frequency before the next
> > activation so that we're ready for it?
> > 
> 
> I'm saying that there is no need to ramp, simply select the frequency
> that is needed for a task (or a set of them).
> 
> > If not, it sounds like it will only make the frequency request on the
> > next activation when the Active bandwidth increases due to the task
> > waking up. By then task has already started to run, right?
> > 
> 
> When the task is enqueued back we select the frequency considering its
> bandwidth request (and the bandwidth/utilization of the others). So,
> when it actually starts running it will already have enough capacity to
> finish in time.

Here we are factoring out the time required to actually switch to the
required OPP. I think Joel was referring to this time.

That time cannot really be eliminated but from having faster OOP
swiching HW support. Still, jumping strating to the "optimal" OPP
instead of rumping up is a big improvement.


-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-16 12:27                   ` Patrick Bellasi
@ 2017-03-16 12:44                     ` Juri Lelli
  2017-03-16 16:58                       ` Joel Fernandes
  0 siblings, 1 reply; 66+ messages in thread
From: Juri Lelli @ 2017-03-16 12:44 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Joel Fernandes, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On 16/03/17 12:27, Patrick Bellasi wrote:
> On 16-Mar 11:16, Juri Lelli wrote:
> > On 15/03/17 16:40, Joel Fernandes wrote:
> > > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > > [..]
> > > >
> > > >> > However, trying to quickly summarize how that would work (for who is
> > > >> > already somewhat familiar with reclaiming bits):
> > > >> >
> > > >> >  - a task utilization contribution is accounted for (at rq level) as
> > > >> >    soon as it wakes up for the first time in a new period
> > > >> >  - its contribution is then removed after the 0lag time (or when the
> > > >> >    task gets throttled)
> > > >> >  - frequency transitions are triggered accordingly
> > > >> >
> > > >> > So, I don't see why triggering a go down request after the 0lag time
> > > >> > expired and quickly reacting to tasks waking up would have create
> > > >> > problems in your case?
> > > >>
> > > >> In my experience, the 'reacting to tasks' bit doesn't work very well.
> > > >
> > > > Humm.. but in this case we won't be 'reacting', we will be
> > > > 'anticipating' tasks' needs, right?
> > > 
> > > Are you saying we will start ramping frequency before the next
> > > activation so that we're ready for it?
> > > 
> > 
> > I'm saying that there is no need to ramp, simply select the frequency
> > that is needed for a task (or a set of them).
> > 
> > > If not, it sounds like it will only make the frequency request on the
> > > next activation when the Active bandwidth increases due to the task
> > > waking up. By then task has already started to run, right?
> > > 
> > 
> > When the task is enqueued back we select the frequency considering its
> > bandwidth request (and the bandwidth/utilization of the others). So,
> > when it actually starts running it will already have enough capacity to
> > finish in time.
> 
> Here we are factoring out the time required to actually switch to the
> required OPP. I think Joel was referring to this time.
> 

Right. But, this is an HW limitation. It seems a problem that every
scheduler driven decision will have to take into account. So, doesn't
make more sense to let the driver (or the governor shim layer) introduce
some sort of hysteresis to frequency changes if needed?

> That time cannot really be eliminated but from having faster OOP
> swiching HW support. Still, jumping strating to the "optimal" OPP
> instead of rumping up is a big improvement.
> 
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-16 12:44                     ` Juri Lelli
@ 2017-03-16 16:58                       ` Joel Fernandes
  2017-03-16 17:17                         ` Juri Lelli
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Fernandes @ 2017-03-16 16:58 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, Linux PM, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On Thu, Mar 16, 2017 at 5:44 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 16/03/17 12:27, Patrick Bellasi wrote:
>> On 16-Mar 11:16, Juri Lelli wrote:
>> > On 15/03/17 16:40, Joel Fernandes wrote:
>> > > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
>> > > [..]
>> > > >
>> > > >> > However, trying to quickly summarize how that would work (for who is
>> > > >> > already somewhat familiar with reclaiming bits):
>> > > >> >
>> > > >> >  - a task utilization contribution is accounted for (at rq level) as
>> > > >> >    soon as it wakes up for the first time in a new period
>> > > >> >  - its contribution is then removed after the 0lag time (or when the
>> > > >> >    task gets throttled)
>> > > >> >  - frequency transitions are triggered accordingly
>> > > >> >
>> > > >> > So, I don't see why triggering a go down request after the 0lag time
>> > > >> > expired and quickly reacting to tasks waking up would have create
>> > > >> > problems in your case?
>> > > >>
>> > > >> In my experience, the 'reacting to tasks' bit doesn't work very well.
>> > > >
>> > > > Humm.. but in this case we won't be 'reacting', we will be
>> > > > 'anticipating' tasks' needs, right?
>> > >
>> > > Are you saying we will start ramping frequency before the next
>> > > activation so that we're ready for it?
>> > >
>> >
>> > I'm saying that there is no need to ramp, simply select the frequency
>> > that is needed for a task (or a set of them).
>> >
>> > > If not, it sounds like it will only make the frequency request on the
>> > > next activation when the Active bandwidth increases due to the task
>> > > waking up. By then task has already started to run, right?
>> > >
>> >
>> > When the task is enqueued back we select the frequency considering its
>> > bandwidth request (and the bandwidth/utilization of the others). So,
>> > when it actually starts running it will already have enough capacity to
>> > finish in time.
>>
>> Here we are factoring out the time required to actually switch to the
>> required OPP. I think Joel was referring to this time.
>>

Yes, that's what I meant.

>
> Right. But, this is an HW limitation. It seems a problem that every
> scheduler driven decision will have to take into account. So, doesn't
> make more sense to let the driver (or the governor shim layer) introduce
> some sort of hysteresis to frequency changes if needed?

The problem IMO which Hysterisis in the governor will not help is what
if you had a DL task that is not waking up for several periods and
then wakes up, then for that wake up, we would still be subject to the
HW limitation of time taken to switch to needed OPP. Right?

>> That time cannot really be eliminated but from having faster OOP
>> swiching HW support. Still, jumping strating to the "optimal" OPP
>> instead of rumping up is a big improvement.

Yes I think so.

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
  2017-03-16 16:58                       ` Joel Fernandes
@ 2017-03-16 17:17                         ` Juri Lelli
  0 siblings, 0 replies; 66+ messages in thread
From: Juri Lelli @ 2017-03-16 17:17 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, Linux PM, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Andres Oportus

On 16/03/17 09:58, Joel Fernandes wrote:
> On Thu, Mar 16, 2017 at 5:44 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > On 16/03/17 12:27, Patrick Bellasi wrote:
> >> On 16-Mar 11:16, Juri Lelli wrote:
> >> > On 15/03/17 16:40, Joel Fernandes wrote:
> >> > > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> > > [..]
> >> > > >
> >> > > >> > However, trying to quickly summarize how that would work (for who is
> >> > > >> > already somewhat familiar with reclaiming bits):
> >> > > >> >
> >> > > >> >  - a task utilization contribution is accounted for (at rq level) as
> >> > > >> >    soon as it wakes up for the first time in a new period
> >> > > >> >  - its contribution is then removed after the 0lag time (or when the
> >> > > >> >    task gets throttled)
> >> > > >> >  - frequency transitions are triggered accordingly
> >> > > >> >
> >> > > >> > So, I don't see why triggering a go down request after the 0lag time
> >> > > >> > expired and quickly reacting to tasks waking up would have create
> >> > > >> > problems in your case?
> >> > > >>
> >> > > >> In my experience, the 'reacting to tasks' bit doesn't work very well.
> >> > > >
> >> > > > Humm.. but in this case we won't be 'reacting', we will be
> >> > > > 'anticipating' tasks' needs, right?
> >> > >
> >> > > Are you saying we will start ramping frequency before the next
> >> > > activation so that we're ready for it?
> >> > >
> >> >
> >> > I'm saying that there is no need to ramp, simply select the frequency
> >> > that is needed for a task (or a set of them).
> >> >
> >> > > If not, it sounds like it will only make the frequency request on the
> >> > > next activation when the Active bandwidth increases due to the task
> >> > > waking up. By then task has already started to run, right?
> >> > >
> >> >
> >> > When the task is enqueued back we select the frequency considering its
> >> > bandwidth request (and the bandwidth/utilization of the others). So,
> >> > when it actually starts running it will already have enough capacity to
> >> > finish in time.
> >>
> >> Here we are factoring out the time required to actually switch to the
> >> required OPP. I think Joel was referring to this time.
> >>
> 
> Yes, that's what I meant.
> 
> >
> > Right. But, this is an HW limitation. It seems a problem that every
> > scheduler driven decision will have to take into account. So, doesn't
> > make more sense to let the driver (or the governor shim layer) introduce
> > some sort of hysteresis to frequency changes if needed?
> 
> The problem IMO which Hysterisis in the governor will not help is what
> if you had a DL task that is not waking up for several periods and
> then wakes up, then for that wake up, we would still be subject to the
> HW limitation of time taken to switch to needed OPP. Right?
> 

True, but in this case the problem is that you cannot really predict the
future anyway. So, if your HW is so slow to react that it always causes
latency problems then I guess you'll be forced to statically raise your
min_freq value to cope with that HW limitation, indipendently from
scheduling policies/heuristics?

OTOH, hysteresis, when properly tuned, should cover the 'normal' cases.

> >> That time cannot really be eliminated but from having faster OOP
> >> swiching HW support. Still, jumping strating to the "optimal" OPP
> >> instead of rumping up is a big improvement.
> 
> Yes I think so.
> 
> Thanks,
> Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
                   ` (5 preceding siblings ...)
  2017-03-15 11:41 ` [RFC v3 0/5] Add capacity capping support to the CPU controller Rafael J. Wysocki
@ 2017-03-20 14:51 ` Tejun Heo
  2017-03-20 17:22   ` Patrick Bellasi
  6 siblings, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2017-03-20 14:51 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Morten Rasmussen, Dietmar Eggemann

Hello, Patrick.

On Tue, Feb 28, 2017 at 02:38:37PM +0000, Patrick Bellasi wrote:
>  a) Boosting of important tasks, by enforcing a minimum capacity in the
>     CPUs where they are enqueued for execution.
>  b) Capping of background tasks, by enforcing a maximum capacity.
>  c) Containment of OPPs for RT tasks which cannot easily be switched to
>     the usage of the DL class, but still don't need to run at the maximum
>     frequency.

As this is something completely new, I think it'd be a great idea to
give a couple concerete examples in the head message to help people
understand what it's for.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-02-28 14:38 ` [RFC v3 1/5] sched/core: add capacity constraints to " Patrick Bellasi
  2017-03-13 10:46   ` Joel Fernandes (Google)
@ 2017-03-20 17:15   ` Tejun Heo
  2017-03-20 17:36     ` Tejun Heo
  2017-03-20 18:08     ` Patrick Bellasi
  1 sibling, 2 replies; 66+ messages in thread
From: Tejun Heo @ 2017-03-20 17:15 UTC (permalink / raw)
  To: Patrick Bellasi; +Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra

Hello,

On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:
> This patch extends the CPU controller by adding a couple of new
> attributes, capacity_min and capacity_max, which can be used to enforce
> bandwidth boosting and capping. More specifically:
> 
> - capacity_min: defines the minimum capacity which should be granted
>                 (by schedutil) when a task in this group is running,
>                 i.e. the task will run at least at that capacity
> 
> - capacity_max: defines the maximum capacity which can be granted
>                 (by schedutil) when a task in this group is running,
>                 i.e. the task can run up to that capacity

cpu.capacity.min and cpu.capacity.max are the more conventional names.
I'm not sure about the name capacity as it doesn't encode what it does
and is difficult to tell apart from cpu bandwidth limits.  I think
it'd be better to represent what it controls more explicitly.

> These attributes:
> a) are tunable at all hierarchy levels, i.e. root group too

This usually is problematic because there should be a non-cgroup way
of configuring the feature in case cgroup isn't configured or used,
and it becomes awkward to have two separate mechanisms configuring the
same thing.  Maybe the feature is cgroup specific enough that it makes
sense here but this needs more explanation / justification.

> b) allow to create subgroups of tasks which are not violating the
>    capacity constraints defined by the parent group.
>    Thus, tasks on a subgroup can only be more boosted and/or more

For both limits and protections, the parent caps the maximum the
children can get.  At least that's what memcg does for memory.low.
Doing that makes sense for memcg because for memory the parent can
still do protections regardless of what its children are doing and it
makes delegation safe by default.

I understand why you would want a property like capacity to be the
other direction as that way you get more specific as you walk down the
tree for both limits and protections; however, I think we need to
think a bit more about it and ensure that the resulting interface
isn't confusing.  Would it work for capacity to behave the other
direction - ie. a parent's min restricting the highest min that its
descendants can get?  It's completely fine if that's weird.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-20 14:51 ` Tejun Heo
@ 2017-03-20 17:22   ` Patrick Bellasi
  2017-04-10  7:36     ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-20 17:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Morten Rasmussen, Dietmar Eggemann

On 20-Mar 10:51, Tejun Heo wrote:
> Hello, Patrick.

Hi Tejun,

> On Tue, Feb 28, 2017 at 02:38:37PM +0000, Patrick Bellasi wrote:
> >  a) Boosting of important tasks, by enforcing a minimum capacity in the
> >     CPUs where they are enqueued for execution.
> >  b) Capping of background tasks, by enforcing a maximum capacity.
> >  c) Containment of OPPs for RT tasks which cannot easily be switched to
> >     the usage of the DL class, but still don't need to run at the maximum
> >     frequency.
> 
> As this is something completely new, I think it'd be a great idea to
> give a couple concerete examples in the head message to help people
> understand what it's for.

Right, Rafael also asked for a similar better explanation, specifically:
 1. What problem exactly is at hand
 2. What alternative ways of addressing it have been considered
 3. Why the particular one proposed has been chosen over the other ones

I've addressed all these points in one of my previous response in this
thread, you can find here:
   https://lkml.org/lkml/2017/3/16/300

Hereafter are some other (hopefully useful) examples.


A) Boosting of important tasks
==============================

The Android GFX rendering pipeline is composed by a set of tasks which
are relatively small, let say they run for few [ms] every 16 [ms].
The overall generated utilization in the CPU where they are
running is usually below 40/50%.

These tasks are per-application, meaning that every application has
its own set of tasks which constitute the rendering pipeline.

In every moment, there is usually only one application which is the
main one impacting user experience: the one which is in front of his
screen.

Given such an example scenario, currently:

1) the CPUFreq governor selects the OPP based on the actual CPU demand
   of the workload. This is a policy which aims at reducing the power
   consumption while still meeting tasks requirements.
   In this scenario it would pick a mid-range frequency.

   However, for certain important tasks such as these part of the GFX
   pipeline of the current application, it can still be beneficial to
   complete them faster than what would normally happen.
   IOW: it is acceptable to trade-off energy consumption for a better
   reactivity of the system.

2) scheduler signals are used to drive some OPP selection, e.g. PELT
   for CFS tasks.

   However, these signals are usually subject to a dynamic which can
   be relatively slow to build up the required information to select
   the proper frequency.
   This can impact the performance of important tasks, at least
   during their initial activation.

The proposed patch allows to set a minimum capacity for a group of
tasks which has to be (possibly) granted by the system when these
tasks are RUNNABLE.

Ultimately, this allows "informed run-times" to inform the core kernel
components like the scheduler and CPUFreq about tasks requirements.
These information can be used to:

a) Bias OPP selection.
   Thus granting that certain critical tasks always run at least at a
   specified frequency.

b) Bias TASKS placement, which requires an additional extension not
   yet posted to keep things simple.
   This allows heterogeneous systems, where different CPUs have
   different capacities, to schedule important tasks in more capable
   CPUs.


Another interesting example of tasks which can benefits from this
boosting interface are GPU computation workloads. These workloads
usually happen to have a CPU side control thread, which in general
generates a quite small utilization. The small utilization is used to
select a lower frequency in the CPU side.

However, a reduced frequency on the CPU side on certain systems
affects also the performances of the GPU side computation.
In these cases it can be definitively beneficial to force run these
small tasks at an higher frequency to optimize the performance of
off-loaded computations. The proposed interface allows to bump the
frequencies only when these tasks are RUNNABLE without requiring to
set a minimum system-wide frequency constraint.



B) Capping of background tasks
==============================

In the same Android systems, when an application is not in foreground,
we can be interested in limiting the CPU resource it consumes.

The throttling mechanism provided by the CPU bandwidth controller is a
possible solution, which enforces bandwidth by throttling the tasks
within a configured period.
However, for certain use-cases it can be preferred to:
- never suspend tasks, but instead just keep running them at a lower frequency.
- keep running these tasks at higher frequencies when they appears to
  be co-scheduler with tasks without capacity limitations.

Throttling can be the non optimal solution also for workloads
which have very small periods (e.g. 16ms), in which case:

a) using longer cfs_period_us will produce long suspension of the
   tasks, which can thus experience non consistent behaviors.

b) using smaller cfs_period_us will increase the control overheads



C) Containment of OPPs for RT tasks
===================================

This point is conceptually similar to the previous one, but it
focuses mainly to RT tasks to improve how these tasks are currently
managed by the schedutil governor.

The current schedutil implementation enforce the selection of the
maximum OPP every thins a RT task is RUNNABLE. Such a policy can be
overkilling especially for some mobile/embedded use cases, as I better
describe in this other thread, where experimental results are also
reported:
   https://lkml.org/lkml/2017/3/17/214

The proposed solution is generic enough to naturally solve these kind
of corner cases as well thus improving the overall Linux kernel offer
in terms of "application specific" tunings which are possible when
"informed run-times" are available in user-space.


> Thanks.
> 
> -- 
> tejun

Hope this can help in casting some more light in the overall goal for
this proposal.


-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-20 17:15   ` Tejun Heo
@ 2017-03-20 17:36     ` Tejun Heo
  2017-03-20 18:08     ` Patrick Bellasi
  1 sibling, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2017-03-20 17:36 UTC (permalink / raw)
  To: Patrick Bellasi; +Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra

On Mon, Mar 20, 2017 at 01:15:11PM -0400, Tejun Heo wrote:
> > a) are tunable at all hierarchy levels, i.e. root group too
> 
> This usually is problematic because there should be a non-cgroup way
> of configuring the feature in case cgroup isn't configured or used,
> and it becomes awkward to have two separate mechanisms configuring the
> same thing.  Maybe the feature is cgroup specific enough that it makes
> sense here but this needs more explanation / justification.

A related issue here is that what the non-cgroup interface and its
interaction with cgroup should be.  In the long term, I think it's
better to have a generic non-cgroup interface for these new features,
and we've gotten it wrong, or at least inconsistent, across different
settings - most don't affect API accessible settings and just confine
the configuration requested by the application inside the cgroup
constraints; however, cpuset does it the other way and overwrites
configurations set by individual applications.

If we agree that exposing this only through cgroup is fine, this isn't
a concern, but, given that this is a thread property and can obviously
be useful outside cgroups, that seems debatable at the very least.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-20 17:15   ` Tejun Heo
  2017-03-20 17:36     ` Tejun Heo
@ 2017-03-20 18:08     ` Patrick Bellasi
  2017-03-23  0:28       ` Joel Fernandes (Google)
  2017-03-30 21:15       ` Paul Turner
  1 sibling, 2 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-20 18:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra

On 20-Mar 13:15, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:
> > This patch extends the CPU controller by adding a couple of new
> > attributes, capacity_min and capacity_max, which can be used to enforce
> > bandwidth boosting and capping. More specifically:
> > 
> > - capacity_min: defines the minimum capacity which should be granted
> >                 (by schedutil) when a task in this group is running,
> >                 i.e. the task will run at least at that capacity
> > 
> > - capacity_max: defines the maximum capacity which can be granted
> >                 (by schedutil) when a task in this group is running,
> >                 i.e. the task can run up to that capacity
> 
> cpu.capacity.min and cpu.capacity.max are the more conventional names.

Ok, should be an easy renaming.

> I'm not sure about the name capacity as it doesn't encode what it does
> and is difficult to tell apart from cpu bandwidth limits.  I think
> it'd be better to represent what it controls more explicitly.

In the scheduler jargon, capacity represents the amount of computation
that a CPU can provide and it's usually defined to be 1024 for the
biggest CPU (on non SMP systems) running at the highest OPP (i.e.
maximum frequency).

It's true that it kind of overlaps with the concept of "bandwidth".
However, the main difference here is that "bandwidth" is not frequency
(and architecture) scaled.
Thus, for example, assuming we have only one CPU with these two OPPs:

   OPP | Frequency | Capacity
     1 |    500MHz |      512
     2 |      1GHz |     1024

a task running 60% of the time on that CPU when configured to run at
500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
the capacity standpoint, is using only 30% of the available capacity.

IOW, bandwidth is purely temporal based while capacity factors in both
frequency and architectural differences.
Thus, while a "bandwidth" constraint limits the amount of time a task
can use a CPU, independently from the "actual computation" performed,
with the new "capacity" constraints we can enforce much "actual
computation" a task can perform in the "unit of time".

> > These attributes:
> > a) are tunable at all hierarchy levels, i.e. root group too
> 
> This usually is problematic because there should be a non-cgroup way
> of configuring the feature in case cgroup isn't configured or used,
> and it becomes awkward to have two separate mechanisms configuring the
> same thing.  Maybe the feature is cgroup specific enough that it makes
> sense here but this needs more explanation / justification.

In the previous proposal I used to expose global tunables under
procfs, e.g.:

 /proc/sys/kernel/sched_capacity_min
 /proc/sys/kernel/sched_capacity_max

which can be used to defined tunable root constraints when CGroups are
not available, and becomes RO when CGroups are.

Can this be eventually an acceptable option?

In any case I think that this feature will be mainly targeting CGroup
based systems. Indeed, one of the main goals is to collect
"application specific" information from "informed run-times". Being
"application specific" means that we need a way to classify
applications depending on the runtime context... and that capability
in Linux is ultimately provided via the CGroup interface.

> > b) allow to create subgroups of tasks which are not violating the
> >    capacity constraints defined by the parent group.
> >    Thus, tasks on a subgroup can only be more boosted and/or more
> 
> For both limits and protections, the parent caps the maximum the
> children can get.  At least that's what memcg does for memory.low.
> Doing that makes sense for memcg because for memory the parent can
> still do protections regardless of what its children are doing and it
> makes delegation safe by default.

Just to be more clear, the current proposal enforces:

- capacity_max_child <= capacity_max_parent

  Since, if a task is constrained to get only up to a certain amount
  of capacity, than its childs cannot use more than that... eventually
  they can only be further constrained.

- capacity_min_child >= capacity_min_parent

  Since, if a task has been boosted to run at least as much fast, than
  its childs cannot be constrained to go slower without eventually
  impacting parent performance.

> I understand why you would want a property like capacity to be the
> other direction as that way you get more specific as you walk down the
> tree for both limits and protections;

Right, the protection schema is defined in such a way to never affect
parent constraints.

> however, I think we need to
> think a bit more about it and ensure that the resulting interface
> isn't confusing.

Sure.

> Would it work for capacity to behave the other
> direction - ie. a parent's min restricting the highest min that its
> descendants can get?  It's completely fine if that's weird.

I had a thought about that possibility and it was not convincing me
from the use-cases standpoint, at least for the ones I've considered.

Reason is that capacity_min is used to implement a concept of
"boosting" where, let say we want to "run a task faster then a minimum
frequency". Assuming that this constraint has been defined because we
know that this task, and likely all its descendant threads, needs at
least that capacity level to perform according to expectations.

In that case the "refining down the hierarchy" can require to boost
further some threads but likely not less.

Does this make sense?

To me this seems to match quite well at least Android/ChromeOS
specific use-cases. I'm not sure if there can be other different
use-cases in the domain for example of managed containers.


> Thanks.
> 
> -- 
> tejun

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-16  3:15       ` Joel Fernandes
@ 2017-03-20 22:51         ` Rafael J. Wysocki
  2017-03-21 11:01           ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 22:51 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Rafael J. Wysocki, Patrick Bellasi, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Rafael J . Wysocki, Paul Turner, Vincent Guittot,
	John Stultz, Todd Kjos, Tim Murray, Andres Oportus, Juri Lelli,
	Morten Rasmussen, Dietmar Eggemann

On Thu, Mar 16, 2017 at 4:15 AM, Joel Fernandes <joelaf@google.com> wrote:
> Hi Rafael,

Hi,

> On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
>>>> Do you have any practical examples of that, like for example what exactly
>>>> Android is going to use this for?
>>>
>>> In general, every "informed run-time" usually know quite a lot about
>>> tasks requirements and how they impact the user experience.
>>>
>>> In Android for example tasks are classified depending on their _current_
>>> role. We can distinguish for example between:
>>>
>>> - TOP_APP:    which are tasks currently affecting the UI, i.e. part of
>>>               the app currently in foreground
>>> - BACKGROUND: which are tasks not directly impacting the user
>>>               experience
>>>
>>> Given these information it could make sense to adopt different
>>> service/optimization policy for different tasks.
>>> For example, we can be interested in
>>> giving maximum responsiveness to TOP_APP tasks while we still want to
>>> be able to save as much energy as possible for the BACKGROUND tasks.
>>>
>>> That's where the proposal in this series (partially) comes on hand.
>>
>> A question: Does "responsiveness" translate directly to "capacity" somehow?
>>
>> Moreover, how exactly is "responsiveness" defined?
>
> Responsiveness is basically how quickly the UI is responding to user
> interaction after doing its computation, application-logic and
> rendering. Android apps have 2 important threads, the main thread (or
> UI thread) which does all the work and computation for the app, and a
> Render thread which does the rendering and submission of frames to
> display pipeline for further composition and display.
>
> We wish to bias towards performance than energy for this work since
> this front facing to the user and we don't care about much about
> energy for these tasks at this point, what's most critical is
> completion as quickly as possible so the user experience doesn't
> suffer from a performance issue that is noticeable.
>
> One metric to define this is "Jank" where we drop frames and aren't
> able to render on time. One of the reasons this can happen because the
> main thread (UI thread) took longer than expected for some
> computation. Whatever the interface - we'd just like to bias the
> scheduling and frequency guidance to be more concerned with
> performance and less with energy. And use this information for both
> frequency selection and task placement. 'What we need' is also app
> dependent since every app has its own main thread and is free to
> compute whatever it needs. So Android can't estimate this - but we do
> know that this app is user facing so in broad terms the interface is
> used to say please don't sacrifice performance for these top-apps -
> without accurately defining what these performance needs really are
> because we don't know it.
> For YouTube app for example, the complexity of the video decoding and
> the frame rate are very variable depending on the encoding scheme and
> the video being played. The flushing of the frames through the display
> pipeline is also variable (frame rate depends on the video being
> decoded), so this work is variable and we can't say for sure in
> definitive terms how much capacity we need.
>
> What we can do is with Patrick's work, we can take the worst case
> based on measurements and specify say we need atleast this much
> capacity regardless of what load-tracking thinks we need and then we
> can scale frequency accordingly. This is the usecase for the minimum
> capacity in his clamping patch. This is still not perfect in terms of
> defining something accurately because - we don't even know how much we
> need, but atleast in broad terms we have some way of telling the
> governor to maintain atleast X capacity.

First off, it all seems to depend a good deal on what your
expectations regarding the in-kernel performance scaling are.

You seem to be expecting it to decide whether or not to sacrifice some
performance for energy savings, but it can't do that really, simply
because it has no guidance on that.  It doesn't know how much
performance (or capacity) it can trade for a given amount of energy,
for example.

What it can do and what I expect it to be doing is to avoid
maintaining excess capacity (maintaining capacity is expensive in
general and a clear waste if the capacity is not actually used).

For instance, if you take the schedutil governor, it doesn't do
anything really fancy.  It just attempts to set a frequency sufficient
to run the given workload without slowing it down artificially, but
not much higher than that, and that's not based on any arcane
energy-vs-performance considerations.  It's based on an (arguably
vague) idea about how fast should be sufficient.

So if you want to say "please don't sacrifice performance for these
top-apps" to it, chances are it will not understand what you are
asking it for. :-)

It only may take the minimum capacity limit for a task as a correction
to its idea about how fast is sufficient in this particular case (and
energy doesn't even enter the picture at this point).  Now, of course,
its idea about what should be sufficient may be entirely incorrect for
some reason, but then the question really is: why?  And whether or not
it can be fixed without supplying corrections from user space in a
very direct way.

What you are saying generally indicates that you see under-provisioned
tasks and that's rather nor because the kernel tries to sacrifice
performance for energy.  Maybe the CPU utilization is under-estimated
by schedutil or the scheduler doesn't give enough time to these
particular tasks for some reason.  In any case, having a way to set a
limit from user space may allow you to work around these issues quite
bluntly and is not a solution.  And even if the underlying problems
are solved, the user space interface will stay there and will have to
be maintained going forward.

Also when you set a minimum frequency limit from user space, you may
easily over-provision the task and that would defeat the purpose of
what the kernel tries to achieve.

> For the clamping of maximum capacity, there are usecases like
> background tasks like Patrick said, but also usecases where we don't
> want to run at max frequency even though load-tracking thinks that we
> need to. For example, there are case where for foreground camera
> tasks, where we want to provide sustainable performance without
> entering thermal throttling, so the capping will help there.

Fair enough.

To me, that case is more compelling than the previous one, but again
I'm not sure if the ability to set a specific capacity limit may fit
the bill entirely.  You need to know what limit to set in the first
place (and that may depend on multiple factors in principle) and then
you may need to adjust it over time and so on.

>>> What we propose is a "standard" interface to collect sensible
>>> information from "informed run-times" which can be used to:
>>>
>>> a) classify tasks according to the main optimization goals:
>>>    performance boosting vs energy saving
>>>
>>> b) support a more dynamic tuning of kernel side behaviors, mainly
>>>    OPPs selection and tasks placement
>>>
>>> Regarding this last point, this series specifically represents a
>>> proposal for the integration with schedutil. The main usages we are
>>> looking for in Android are:
>>>
>>> a) Boosting the OPP selected for certain critical tasks, with the goal
>>>    to speed-up their completion regardless of (potential) energy impacts.
>>>    A kind-of "race-to-idle" policy for certain tasks.
>>
>> It looks like this could be addressed by adding a "this task should
>> race to idle" flag too.
>
> But he said 'kind-of' race-to-idle. Racing to idle all the time for
> ex. at max frequency will be wasteful of energy so although we don't
> care about energy much for top-apps, we do care a bit.

You actually don't know whether or not it will be wasteful and there
may even be differences from workload to workload on the same system
in that respect.

>>
>>> b) Capping the OPP selection for certain non critical tasks, which is
>>>    a major concerns especially for RT tasks in mobile context, but
>>>    it also apply to FAIR tasks representing background activities.
>>
>> Well, is the information on how much CPU capacity assign to those
>> tasks really there in user space?  What's the source of it if so?
>
> I believe this is just a matter of tuning and modeling for what is
> needed. For ex. to prevent thermal throttling as I mentioned and also
> to ensure background activities aren't running at highest frequency
> and consuming excessive energy (since racing to idle at higher
> frequency is more expensive energy than running slower to idle since
> we run at higher voltages at higher frequency and the slow of the
> perf/W curve is steeper - p = c * V^2 * F. So the V component being
> higher just drains more power quadratic-ally which is of no use to
> background tasks - infact in some tests, we're just as happy with
> setting them at much lower frequencies than what load-tracking thinks
> is needed.

As I said, I actually can see a need to go lower than what performance
scaling thinks, because the way it tries to estimate the sufficient
capacity is by checking how much utilization is there for the
currently provided capacity and adjusting if necessary.  OTOH, there
are applications aggressive enough to be able to utilize *any*
capacity provided to them.

>>>> I gather that there is some experience with the current EAS implementation
>>>> there, so I wonder how this work is related to that.
>>>
>>> You right. We started developing a task boosting strategy a couple of
>>> years ago. The first implementation we did is what is currently in use
>>> by the EAS version in used on Pixel smartphones.
>>>
>>> Since the beginning our attitude has always been "mainline first".
>>> However, we found it extremely valuable to proof both interface's
>>> design and feature's benefits on real devices. That's why we keep
>>> backporting these bits on different Android kernels.
>>>
>>> Google, which primary representatives are in CC, is also quite focused
>>> on using mainline solutions for their current and future solutions.
>>> That's why, after the release of the Pixel devices end of last year,
>>> we refreshed and posted the proposal on LKML [1] and collected a first
>>> run of valuable feedbacks at LCP [2].
>>
>> Thanks for the info, but my question was more about how it was related
>> from the technical angle.  IOW, there surely is some experience
>> related to how user space can deal with energy problems and I would
>> expect that experience to be an important factor in designing a kernel
>> interface for that user space, so I wonder if any particular needs of
>> the Android user space are addressed here.
>>
>> I'm not intimately familiar with Android, so I guess I would like to
>> be educated somewhat on that. :-)
>
> Hope this sheds some light into the Android side of things a bit.

Yes, it does, thanks!

Best regards,
Rafael

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-20 22:51         ` Rafael J. Wysocki
@ 2017-03-21 11:01           ` Patrick Bellasi
  2017-03-24 23:52             ` Rafael J. Wysocki
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-21 11:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Joel Fernandes, Rafael J. Wysocki, Linux Kernel Mailing List,
	Linux PM, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Paul Turner, Jonathan Corbet,
	Vincent Guittot, John Stultz, Todd Kjos, Tim Murray,
	Andres Oportus, Juri Lelli, Morten Rasmussen, Dietmar Eggemann

On 20-Mar 23:51, Rafael J. Wysocki wrote:
> On Thu, Mar 16, 2017 at 4:15 AM, Joel Fernandes <joelaf@google.com> wrote:
> > Hi Rafael,
> 
> Hi,
> 
> > On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
> >>>> Do you have any practical examples of that, like for example what exactly
> >>>> Android is going to use this for?
> >>>
> >>> In general, every "informed run-time" usually know quite a lot about
> >>> tasks requirements and how they impact the user experience.
> >>>
> >>> In Android for example tasks are classified depending on their _current_
> >>> role. We can distinguish for example between:
> >>>
> >>> - TOP_APP:    which are tasks currently affecting the UI, i.e. part of
> >>>               the app currently in foreground
> >>> - BACKGROUND: which are tasks not directly impacting the user
> >>>               experience
> >>>
> >>> Given these information it could make sense to adopt different
> >>> service/optimization policy for different tasks.
> >>> For example, we can be interested in
> >>> giving maximum responsiveness to TOP_APP tasks while we still want to
> >>> be able to save as much energy as possible for the BACKGROUND tasks.
> >>>
> >>> That's where the proposal in this series (partially) comes on hand.
> >>
> >> A question: Does "responsiveness" translate directly to "capacity" somehow?
> >>
> >> Moreover, how exactly is "responsiveness" defined?
> >
> > Responsiveness is basically how quickly the UI is responding to user
> > interaction after doing its computation, application-logic and
> > rendering. Android apps have 2 important threads, the main thread (or
> > UI thread) which does all the work and computation for the app, and a
> > Render thread which does the rendering and submission of frames to
> > display pipeline for further composition and display.
> >
> > We wish to bias towards performance than energy for this work since
> > this front facing to the user and we don't care about much about
> > energy for these tasks at this point, what's most critical is
> > completion as quickly as possible so the user experience doesn't
> > suffer from a performance issue that is noticeable.
> >
> > One metric to define this is "Jank" where we drop frames and aren't
> > able to render on time. One of the reasons this can happen because the
> > main thread (UI thread) took longer than expected for some
> > computation. Whatever the interface - we'd just like to bias the
> > scheduling and frequency guidance to be more concerned with
> > performance and less with energy. And use this information for both
> > frequency selection and task placement. 'What we need' is also app
> > dependent since every app has its own main thread and is free to
> > compute whatever it needs. So Android can't estimate this - but we do
> > know that this app is user facing so in broad terms the interface is
> > used to say please don't sacrifice performance for these top-apps -
> > without accurately defining what these performance needs really are
> > because we don't know it.
> > For YouTube app for example, the complexity of the video decoding and
> > the frame rate are very variable depending on the encoding scheme and
> > the video being played. The flushing of the frames through the display
> > pipeline is also variable (frame rate depends on the video being
> > decoded), so this work is variable and we can't say for sure in
> > definitive terms how much capacity we need.
> >
> > What we can do is with Patrick's work, we can take the worst case
> > based on measurements and specify say we need atleast this much
> > capacity regardless of what load-tracking thinks we need and then we
> > can scale frequency accordingly. This is the usecase for the minimum
> > capacity in his clamping patch. This is still not perfect in terms of
> > defining something accurately because - we don't even know how much we
> > need, but atleast in broad terms we have some way of telling the
> > governor to maintain atleast X capacity.
> 
> First off, it all seems to depend a good deal on what your
> expectations regarding the in-kernel performance scaling are.
> 
> You seem to be expecting it to decide whether or not to sacrifice some
> performance for energy savings, but it can't do that really, simply
> because it has no guidance on that.  It doesn't know how much
> performance (or capacity) it can trade for a given amount of energy,
> for example.

That's true, right now. But in ARM we are working since a cpuple of
years to refine the concept of an energy model which improves the
scheduler knowledge about the energy-vs-performance trade-off.

> What it can do and what I expect it to be doing is to avoid
> maintaining excess capacity (maintaining capacity is expensive in
> general and a clear waste if the capacity is not actually used).
> 
> For instance, if you take the schedutil governor, it doesn't do
> anything really fancy.  It just attempts to set a frequency sufficient
> to run the given workload without slowing it down artificially, but
> not much higher than that, and that's not based on any arcane
> energy-vs-performance considerations.  It's based on an (arguably
> vague) idea about how fast should be sufficient.
> 
> So if you want to say "please don't sacrifice performance for these
> top-apps" to it, chances are it will not understand what you are
> asking it for. :-)

Actually, this series are the foundation bits of a more complete
solution, already in use on Pixel phones.

While this proposal focuses just on "OPP biasing", some additional
bits (not yet posted to keep things simple) exploit the Energy Model
information to provide support for "task placement biasing".

Those bits address also the concept of:

   how much energy I want to sacrifice to get a certain speedup?

> It only may take the minimum capacity limit for a task as a correction
> to its idea about how fast is sufficient in this particular case (and
> energy doesn't even enter the picture at this point).  Now, of course,
> its idea about what should be sufficient may be entirely incorrect for
> some reason, but then the question really is: why?  And whether or not
> it can be fixed without supplying corrections from user space in a
> very direct way.

- Why the estimation is incorrect?

Because, looking at CFS tasks for example, PELT is a "running
estimator". Its view about how much capacity a task needs changes
continuously over time. In short it is missing an aggregation and
consolidation mechanism which allows to exploit better information on
task's past activations.
We have a proposal to possibly fix that and we will post if soonish.

However, still it can be that for a certain task you want to add some
"safety margin" to accommodate for possible workload variations.
That's required also if you have a perfect knowledge about task
requirements for a task, which has been built entirely in kernel
space, based on past activations.
if your task is such important, you don't care to give it "just
enough".  You need to know how much more to give him, and this
information can come only from user-space where someone with more
information can use a properly defined API to feed them to the
scheduler using a per-task interface.

- Can it be fixed without corrections from user-space?

Not completely, more details hereafter.

> What you are saying generally indicates that you see under-provisioned
> tasks and that's rather nor because the kernel tries to sacrifice
> performance for energy.  Maybe the CPU utilization is under-estimated
> by schedutil or the scheduler doesn't give enough time to these
> particular tasks for some reason.  In any case, having a way to set a
> limit from user space may allow you to work around these issues quite
> bluntly and is not a solution.  And even if the underlying problems
> are solved, the user space interface will stay there and will have to
> be maintained going forward.

I don't agree on that point, mainly because I don't see that as a
workaround. In your view you it seems that everything can be solved
entirely in kernel space. In my view instead what we are after is a
properly defined interface where kernel-space and user-space can
potentially close a control loop where:
a) user-space, which has much more a-priori information about tasks
   requirements can feed some constraints to kernel-space.
b) kernel-space, which has optimized end efficient mechanisms, enforce
   these constraints on a per task basis.

After all this is not a new concept on OS design, we already have
different interfaces which allows to tune scheduler behaviors on a
per-task bias. What we are missing right now is a similar _per-task
interface_ to bias OPP selection and a slightly improved/alternative
way to bias task placement _without_ doing scheduling decisions in
user-space.

Here is a graphical representation of these concepts:

      +-------------+    +-------------+  +-------------+
      | App1 Tasks  ++   | App2 Tasks  ++ | App3 Tasks  ++
      |             ||   |             || |             ||
      +--------------|   +--------------| +--------------|
       +-------------+    +-------------+  +-------------+
                |               |              |
  +----------------------------------------------------------+
  |                                                          |
  |      +--------------------------------------------+      |
  |      |  +-------------------------------------+   |      |
  |      |  |      Run-Time Optimized Services    |   |      |
  |      |  |        (e.g. execution model)       |   |      |
  |      |  +-------------------------------------+   |      |
  |      |                                            |      |
  |      |     Informed Run-Time Resource Manager     |      |
  |      |   (Android, ChromeOS, Kubernets, etc...)   |      |
  |      +------------------------------------------^-+      |
  |        |                                        |        |
  |        |Constraints                             |        |
  |        |(OPP and Task Placement biasing)        |        |
  |        |                                        |        |
  |        |                             Monitoring |        |
  |      +-v------------------------------------------+      |
  |      |               Linux Kernel                 |      |
  |      |         (Scheduler, schedutil, ...)        |      |
  |      +--------------------------------------------+      |
  |                                                          |
  | Closed control and optimization loop                     |
  +----------------------------------------------------------+

What is important to notice is that there is a middleware, in between
the kernel and the applications. This is a special kind of user-space
where it is still safe for the kernel to delegate some "decisions".

> Also when you set a minimum frequency limit from user space, you may
> easily over-provision the task and that would defeat the purpose of
> what the kernel tries to achieve.

No, if an "informed user-space" wants to over-provision a task it's
because it has already decided that it makes sense to limit the kernel
energy optimization for that specific class of tasks.
It is not necessarily kernel business to know why, it is just required
to do its best within the provided constraints.

> > For the clamping of maximum capacity, there are usecases like
> > background tasks like Patrick said, but also usecases where we don't
> > want to run at max frequency even though load-tracking thinks that we
> > need to. For example, there are case where for foreground camera
> > tasks, where we want to provide sustainable performance without
> > entering thermal throttling, so the capping will help there.
> 
> Fair enough.
> 
> To me, that case is more compelling than the previous one, but again
> I'm not sure if the ability to set a specific capacity limit may fit
> the bill entirely.  You need to know what limit to set in the first
> place (and that may depend on multiple factors in principle) and then
> you may need to adjust it over time and so on.

Exactly and again, the informed run-time knows which limits to set,
on which tasks and when change/update/tune them.

> >>> What we propose is a "standard" interface to collect sensible
> >>> information from "informed run-times" which can be used to:
> >>>
> >>> a) classify tasks according to the main optimization goals:
> >>>    performance boosting vs energy saving
> >>>
> >>> b) support a more dynamic tuning of kernel side behaviors, mainly
> >>>    OPPs selection and tasks placement
> >>>
> >>> Regarding this last point, this series specifically represents a
> >>> proposal for the integration with schedutil. The main usages we are
> >>> looking for in Android are:
> >>>
> >>> a) Boosting the OPP selected for certain critical tasks, with the goal
> >>>    to speed-up their completion regardless of (potential) energy impacts.
> >>>    A kind-of "race-to-idle" policy for certain tasks.
> >>
> >> It looks like this could be addressed by adding a "this task should
> >> race to idle" flag too.
> >
> > But he said 'kind-of' race-to-idle. Racing to idle all the time for
> > ex. at max frequency will be wasteful of energy so although we don't
> > care about energy much for top-apps, we do care a bit.
> 
> You actually don't know whether or not it will be wasteful and there
> may even be differences from workload to workload on the same system
> in that respect.

The workload dependencies are solved by the "informed run-time",
that's why what we are proposing is a per-task interface.
Moreover, notice that most of the optimization can still be targeted
to services provided by the "informed run-time". Thus the dependencies
on the actual applications are kind-of limited and still they can be
factored in by properly defined interfaces exposed by the "informed
run-time".

> >>> b) Capping the OPP selection for certain non critical tasks, which is
> >>>    a major concerns especially for RT tasks in mobile context, but
> >>>    it also apply to FAIR tasks representing background activities.
> >>
> >> Well, is the information on how much CPU capacity assign to those
> >> tasks really there in user space?  What's the source of it if so?
> >
> > I believe this is just a matter of tuning and modeling for what is
> > needed. For ex. to prevent thermal throttling as I mentioned and also
> > to ensure background activities aren't running at highest frequency
> > and consuming excessive energy (since racing to idle at higher
> > frequency is more expensive energy than running slower to idle since
> > we run at higher voltages at higher frequency and the slow of the
> > perf/W curve is steeper - p = c * V^2 * F. So the V component being
> > higher just drains more power quadratic-ally which is of no use to
> > background tasks - infact in some tests, we're just as happy with
> > setting them at much lower frequencies than what load-tracking thinks
> > is needed.
> 
> As I said, I actually can see a need to go lower than what performance
> scaling thinks, because the way it tries to estimate the sufficient
> capacity is by checking how much utilization is there for the
> currently provided capacity and adjusting if necessary.  OTOH, there
> are applications aggressive enough to be able to utilize *any*
> capacity provided to them.

Here you are not considering the control role exercised by the
middleware layer. Apps cannot really do whatever they want, they get
only what the "informed run-time" considers it sufficient for them.

IOW, they live in a "managed user-space".

> >>>> I gather that there is some experience with the current EAS implementation
> >>>> there, so I wonder how this work is related to that.
> >>>
> >>> You right. We started developing a task boosting strategy a couple of
> >>> years ago. The first implementation we did is what is currently in use
> >>> by the EAS version in used on Pixel smartphones.
> >>>
> >>> Since the beginning our attitude has always been "mainline first".
> >>> However, we found it extremely valuable to proof both interface's
> >>> design and feature's benefits on real devices. That's why we keep
> >>> backporting these bits on different Android kernels.
> >>>
> >>> Google, which primary representatives are in CC, is also quite focused
> >>> on using mainline solutions for their current and future solutions.
> >>> That's why, after the release of the Pixel devices end of last year,
> >>> we refreshed and posted the proposal on LKML [1] and collected a first
> >>> run of valuable feedbacks at LCP [2].
> >>
> >> Thanks for the info, but my question was more about how it was related
> >> from the technical angle.  IOW, there surely is some experience
> >> related to how user space can deal with energy problems and I would
> >> expect that experience to be an important factor in designing a kernel
> >> interface for that user space, so I wonder if any particular needs of
> >> the Android user space are addressed here.
> >>
> >> I'm not intimately familiar with Android, so I guess I would like to
> >> be educated somewhat on that. :-)
> >
> > Hope this sheds some light into the Android side of things a bit.
> 
> Yes, it does, thanks!

Interesting discussion, thanks! ;-)

> Best regards,
> Rafael

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-20 18:08     ` Patrick Bellasi
@ 2017-03-23  0:28       ` Joel Fernandes (Google)
  2017-03-23 10:32         ` Patrick Bellasi
  2017-03-30 21:15       ` Paul Turner
  1 sibling, 1 reply; 66+ messages in thread
From: Joel Fernandes (Google) @ 2017-03-23  0:28 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, Linux Kernel Mailing List, linux-pm, Ingo Molnar,
	Peter Zijlstra

Hi,

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:
[..]
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>

But then we would lose out on being able to attach capacity
constraints to specific tasks or groups of tasks?

> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.

I think the concern raised is more about whether CGroups is the right
interface to use for attaching capacity constraints to task or groups
of tasks, or is there a better way to attach such constraints?

I am actually looking at a workload where its desirable to attach such
constraints to only 1 thread or task, in this case it would be a bit
overkill to use CGroups to attach such property just for 1 task with
specific constraints and it would be beneficial that along with the
CGroup interface, there's also an interface to attach it to individual
tasks. The other advantage of such interface is we don't have to
create a separate CGroup for every new constraint limit and can have
several tasks with different unique constraints.

Regards,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23  0:28       ` Joel Fernandes (Google)
@ 2017-03-23 10:32         ` Patrick Bellasi
  2017-03-23 16:01           ` Tejun Heo
  2017-03-24  7:02           ` Joel Fernandes (Google)
  0 siblings, 2 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-23 10:32 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Tejun Heo, Linux Kernel Mailing List, linux-pm, Ingo Molnar,
	Peter Zijlstra

On 22-Mar 17:28, Joel Fernandes (Google) wrote:
> Hi,
> 
> On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > On 20-Mar 13:15, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:
> [..]
> >> > These attributes:
> >> > a) are tunable at all hierarchy levels, i.e. root group too
> >>
> >> This usually is problematic because there should be a non-cgroup way
> >> of configuring the feature in case cgroup isn't configured or used,
> >> and it becomes awkward to have two separate mechanisms configuring the
> >> same thing.  Maybe the feature is cgroup specific enough that it makes
> >> sense here but this needs more explanation / justification.
> >
> > In the previous proposal I used to expose global tunables under
> > procfs, e.g.:
> >
> >  /proc/sys/kernel/sched_capacity_min
> >  /proc/sys/kernel/sched_capacity_max
> >
> 
> But then we would lose out on being able to attach capacity
> constraints to specific tasks or groups of tasks?

Yes, right. If CGroups are not available than you cannot specify
per-task constraints. This is just a system-wide global tunable.

Question is: does this overall proposal makes sense outside the scope
of task groups classification? (more on that afterwards)

> > which can be used to defined tunable root constraints when CGroups are
> > not available, and becomes RO when CGroups are.
> >
> > Can this be eventually an acceptable option?
> >
> > In any case I think that this feature will be mainly targeting CGroup
> > based systems. Indeed, one of the main goals is to collect
> > "application specific" information from "informed run-times". Being
> > "application specific" means that we need a way to classify
> > applications depending on the runtime context... and that capability
> > in Linux is ultimately provided via the CGroup interface.
> 
> I think the concern raised is more about whether CGroups is the right
> interface to use for attaching capacity constraints to task or groups
> of tasks, or is there a better way to attach such constraints?

Notice that CGroups based classification allows to easily enforce
the concept of "delegation containment". I think this feature should
be nice to have whatever interface we choose.

However, potentially we can define a proper per-task API; are you
thinking to something specifically?

> I am actually looking at a workload where its desirable to attach such
> constraints to only 1 thread or task, in this case it would be a bit
> overkill to use CGroups to attach such property just for 1 task with
> specific constraints

Well, perhaps it depends on how and when CGroups are created.
If we think about using a proper "Organize Once and Control" model,
(i.e. every app gets its own CGroup at creation time and there it
lives, for the rest of its time, while the user-space run-time
eventually tunes the assigned resources) than run-time overheads
should not be a major concerns.

AFAIK, Cgroups main overheads are associated to tasks migration and
tuning. Tuning will not be an issue for the kind of actions required
by capacity clamping. Regarding migrations, they should be avoided as
much as possible when CGroups are in use.

> and it would be beneficial that along with the
> CGroup interface, there's also an interface to attach it to individual
> tasks.

IMO a dual interface to do the same things will be at least confusing
and also more complicated to maintain.

> The other advantage of such interface is we don't have to
> create a separate CGroup for every new constraint limit and can have
> several tasks with different unique constraints.

That's still possible using CGroups and IMO it will not be the "most
common case".
Don't you think that in general we will need to set constraints at
applications level, thus group of tasks?

As a general rule we should probably go for an interface which makes
easy the most common case.

> Regards,
> Joel

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23 10:32         ` Patrick Bellasi
@ 2017-03-23 16:01           ` Tejun Heo
  2017-03-23 18:15             ` Patrick Bellasi
  2017-03-24  7:02           ` Joel Fernandes (Google)
  1 sibling, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2017-03-23 16:01 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra

Hello,

On Thu, Mar 23, 2017 at 10:32:54AM +0000, Patrick Bellasi wrote:
> > But then we would lose out on being able to attach capacity
> > constraints to specific tasks or groups of tasks?
> 
> Yes, right. If CGroups are not available than you cannot specify
> per-task constraints. This is just a system-wide global tunable.
> 
> Question is: does this overall proposal makes sense outside the scope
> of task groups classification? (more on that afterwards)

I think it does, given that it's a per-thread property which requires
internal application knowledge to tune.

> > I think the concern raised is more about whether CGroups is the right
> > interface to use for attaching capacity constraints to task or groups
> > of tasks, or is there a better way to attach such constraints?
> 
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
> 
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?

I don't think the overall outcome was too good when we used cgroup as
the direct way of configuring certain attributes - it either excludes
the possibility of easily accessible API from application side or
conflicts with the attributes set through such API.  It's a lot
clearer when cgroup just sets what's allowed under the hierarchy.

This is also in line with the aspect that cgroup for the most part is
a scoping mechanism - it's the most straight-forward to implement and
use when the behavior inside cgroup matches a system without cgroup,
just scoped.  It shows up here too.  If you take out the cgroup part,
you're left with an interface which is hardly useful.  cgroup isn't
scoping the global system here.  It's becoming the primary interface
for this feature which most likely isn't a good sign.

So, my suggestion is to implement it as a per-task API.  If the
feature calls for scoped restrictions, we definitely can add cgroup
support for that but I'm really not convinced about using cgroup as
the primary interface for this.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23 16:01           ` Tejun Heo
@ 2017-03-23 18:15             ` Patrick Bellasi
  2017-03-23 18:39               ` Tejun Heo
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-03-23 18:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra

On 23-Mar 12:01, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Thu, Mar 23, 2017 at 10:32:54AM +0000, Patrick Bellasi wrote:
> > > But then we would lose out on being able to attach capacity
> > > constraints to specific tasks or groups of tasks?
> > 
> > Yes, right. If CGroups are not available than you cannot specify
> > per-task constraints. This is just a system-wide global tunable.
> > 
> > Question is: does this overall proposal makes sense outside the scope
> > of task groups classification? (more on that afterwards)
> 
> I think it does, given that it's a per-thread property which requires
> internal application knowledge to tune.

Yes and no... perhaps I'm biased on some specific usage scenarios, but where I
find this interface more useful is not when apps tune themselves but instead
when an "external actor" (which I usually call an "informed run-time") controls
these apps.

> > > I think the concern raised is more about whether CGroups is the right
> > > interface to use for attaching capacity constraints to task or groups
> > > of tasks, or is there a better way to attach such constraints?
> > 
> > Notice that CGroups based classification allows to easily enforce
> > the concept of "delegation containment". I think this feature should
> > be nice to have whatever interface we choose.
> > 
> > However, potentially we can define a proper per-task API; are you
> > thinking to something specifically?
> 
> I don't think the overall outcome was too good when we used cgroup as
> the direct way of configuring certain attributes - it either excludes
> the possibility of easily accessible API from application side or

That's actually one of the main point: does it make sense to expose
such an API to applications at all?

What we are after is a properly defined interface where kernel-space and
user-space can potentially close this control loop:

a) a "privileged" user-space, which has much more a-priori information about
   tasks requirements, can feed some constraints to kernel-space

b) kernel-space, which has optimized and efficient mechanisms,
   enforces these constraints on a per task basis

Here is a graphical representation of these concepts:

      +-------------+    +-------------+  +-------------+
      | App1 Tasks  ++   | App2 Tasks  ++ | App3 Tasks  ++
      |             ||   |             || |             ||
      +--------------|   +--------------| +--------------|
       +-------------+    +-------------+  +-------------+
                |               |              |
  +----------------------------------------------------------+
  |                                                          |
  |      +--------------------------------------------+      |
  |      |  +-------------------------------------+   |      |
  |      |  |      Run-Time Optimized Services    |   |      |
  |      |  |        (e.g. execution model)       |   |      |
  |      |  +-------------------------------------+   |      |
  |      |                                            |      |
  |      |     Informed Run-Time Resource Manager     |      |
  |      |   (Android, ChromeOS, Kubernets, etc...)   |      |
  |      +------------------------------------------^-+      |
  |        |                                        |        |
  |        |Constraints                             |        |
  |        |(OPP and Task Placement biasing)        |        |
  |        |                                        |        |
  |        |                             Monitoring |        |
  |      +-v------------------------------------------+      |
  |      |               Linux Kernel                 |      |
  |      |         (Scheduler, schedutil, ...)        |      |
  |      +--------------------------------------------+      |
  |                                                          |
  | Closed control and optimization loop                     |
  +----------------------------------------------------------+

What is important to notice is that there is a middleware, in between
the kernel and the applications. This is a special kind of user-space
where it is still safe for the kernel to delegate some "decisions".

The ultimate user of the proposed interface will be such a middleware, not each
and every application. That's why the "containment" feature provided by CGroups
I think is a good fitting for the kind of design.

> conflicts with the attributes set through such API.

In this "run-time resource management" schema, generic applications do not
access the proposed API, which is reserved to the privileged user-space.

Applications eventually can request better services to the middleware, using a
completely different and more abstract API, which can also be domain specific.


> It's a lot clearer when cgroup just sets what's allowed under the hierarchy.
> This is also in line with the aspect that cgroup for the most part is
> a scoping mechanism - it's the most straight-forward to implement and
> use when the behavior inside cgroup matches a system without cgroup,
> just scoped.

I like this concept of "CGroups being a scoping mechanism" and I think it
perfectly matches this use-case as well...

>  It shows up here too.  If you take out the cgroup part,
> you're left with an interface which is hardly useful.  cgroup isn't
> scoping the global system here.

It is, indeed:

1) Applications do not see CGroups, never.
   They use whatever resources are available when CGroups are not in use.

2) When an "Informed Run-time Resource Manager" schema is used, then the same
   applications are scoped in the sense that they becomes "managed applications".

   Managed applications are still completely "unaware" about the CGroup
   interface, they do not relay on that interface for what they have to do.
   However, in this scenario, there is a supervisor which know how much an
   application can get each and every instant.

> It's becoming the primary interface
> for this feature which most likely isn't a good sign.

It's a primary interface yes, but not for apps, only for an (optional)
run-time resource manager.

What we want to enable with this interface is exactly the possibility for a
privileged user-space entity to "scope" different applications.

Described like that we can argue that we can still implement this model using a
custom per-task API. However, this proposal is about "tuning/partitioning" a
resource which is already (would say only) controllable using the CPU
controller.
That's also why the proposed interface has now been defined as a extension of
the CPU controller in such a way to keep a consistent view.

This controller is already used by run-times like Android to "scope" apps by
constraining the amount of CPUs resource they are getting.
Is that not a legitimate usage of the cpu controller?

What we are doing here is just extending it a bit in such a way that, while:

  {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU

we can also use:

  capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
                     during that time.

> So, my suggestion is to implement it as a per-task API.  If the
> feature calls for scoped restrictions, we definitely can add cgroup
> support for that but I'm really not convinced about using cgroup as
> the primary interface for this.

Given this viewpoint, I can definitively see a "scoped restrictions" usage, as
well as the idea that this can be a unique and primary interface.
Again, not exposed generically to apps but targeting a proper integration
of user-space run-time resource managers.

I hope this contributed to clarify better the scope.  Do you still see the
CGroup API not as the best fit for such a usage?

> Thanks.
> 
> --
> tejun

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23 18:15             ` Patrick Bellasi
@ 2017-03-23 18:39               ` Tejun Heo
  2017-03-24  6:37                 ` Joel Fernandes (Google)
  2017-03-30 21:13                 ` Paul Turner
  0 siblings, 2 replies; 66+ messages in thread
From: Tejun Heo @ 2017-03-23 18:39 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra

Hello, Patrick.

On Thu, Mar 23, 2017 at 06:15:33PM +0000, Patrick Bellasi wrote:
> What is important to notice is that there is a middleware, in between
> the kernel and the applications. This is a special kind of user-space
> where it is still safe for the kernel to delegate some "decisions".
> 
> The ultimate user of the proposed interface will be such a middleware, not each
> and every application. That's why the "containment" feature provided by CGroups
> I think is a good fitting for the kind of design.

cgroup isn't required for this type of uses.  We've always had this
sort of usages in combination with mechanisms to restrict what
non-priv applications can do.  The usage is perfectly valid but
whether to use cgroup as the sole interface is a different issue.

Yes, cgroup interface can be used this way; however, it does exclude,
or at least makes pretty cumbersome, different use cases which can be
served by a regular API.  And that isn't the case when we approach it
from the other direction.

> I like this concept of "CGroups being a scoping mechanism" and I think it
> perfectly matches this use-case as well...
> 
> >  It shows up here too.  If you take out the cgroup part,
> > you're left with an interface which is hardly useful.  cgroup isn't
> > scoping the global system here.
> 
> It is, indeed:
> 
> 1) Applications do not see CGroups, never.
>    They use whatever resources are available when CGroups are not in use.
> 
> 2) When an "Informed Run-time Resource Manager" schema is used, then the same
>    applications are scoped in the sense that they becomes "managed applications".
> 
>    Managed applications are still completely "unaware" about the CGroup
>    interface, they do not relay on that interface for what they have to do.
>    However, in this scenario, there is a supervisor which know how much an
>    application can get each and every instant.

But it isn't useful if you take cgroup out of the picture.  cgroup
isn't scoping a feature.  The feature is buried in the cgroup itself.
I don't think it's useful to argue over the fine semantics.  Please
see below.

> > It's becoming the primary interface
> > for this feature which most likely isn't a good sign.
> 
> It's a primary interface yes, but not for apps, only for an (optional)
> run-time resource manager.
> 
> What we want to enable with this interface is exactly the possibility for a
> privileged user-space entity to "scope" different applications.
> 
> Described like that we can argue that we can still implement this model using a
> custom per-task API. However, this proposal is about "tuning/partitioning" a
> resource which is already (would say only) controllable using the CPU
> controller.
> That's also why the proposed interface has now been defined as a extension of
> the CPU controller in such a way to keep a consistent view.
> 
> This controller is already used by run-times like Android to "scope" apps by
> constraining the amount of CPUs resource they are getting.
> Is that not a legitimate usage of the cpu controller?
> 
> What we are doing here is just extending it a bit in such a way that, while:
> 
>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
> 
> we can also use:
> 
>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>                      during that time.

Yes, we do have bandwidth restriction as a cgroup only feature, which
is different from how we handle nice levels and weights.  Given the
nature of bandwidth limits, if necessary, it is straight-forward to
expose per-task interface.

capacity min/max isn't the same thing.  It isn't a limit on countable
units of a specific resource and that's why the interface you
suggested for .min is different.  It's restricting attribute set which
can be picked in the subhierarchy rather than controlling distribution
of atoms of the resource.

That's also why we're gonna have problem if we later decide we need a
thread based API for it.  Once we make cgroup the primary owner of the
attribute, it's not straight forward to add another owner.

> > So, my suggestion is to implement it as a per-task API.  If the
> > feature calls for scoped restrictions, we definitely can add cgroup
> > support for that but I'm really not convinced about using cgroup as
> > the primary interface for this.
> 
> Given this viewpoint, I can definitively see a "scoped restrictions" usage, as
> well as the idea that this can be a unique and primary interface.
> Again, not exposed generically to apps but targeting a proper integration
> of user-space run-time resource managers.
> 
> I hope this contributed to clarify better the scope.  Do you still see the
> CGroup API not as the best fit for such a usage?

Yes, I still think so.  It'd be best to first figure out how the
attribute should be configured, inherited and restricted using the
normal APIs and then layer scoped restrictions on top with cgroup.
cgroup shouldn't be used as a way to bypass or get in the way of a
proper API.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23 18:39               ` Tejun Heo
@ 2017-03-24  6:37                 ` Joel Fernandes (Google)
  2017-03-24 15:00                   ` Tejun Heo
  2017-03-30 21:13                 ` Paul Turner
  1 sibling, 1 reply; 66+ messages in thread
From: Joel Fernandes (Google) @ 2017-03-24  6:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Patrick Bellasi, Linux Kernel Mailing List, linux-pm,
	Ingo Molnar, Peter Zijlstra

Hi Tejun,

>> That's also why the proposed interface has now been defined as a extension of
>> the CPU controller in such a way to keep a consistent view.
>>
>> This controller is already used by run-times like Android to "scope" apps by
>> constraining the amount of CPUs resource they are getting.
>> Is that not a legitimate usage of the cpu controller?
>>
>> What we are doing here is just extending it a bit in such a way that, while:
>>
>>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
>>
>> we can also use:
>>
>>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>>                      during that time.
>
> Yes, we do have bandwidth restriction as a cgroup only feature, which
> is different from how we handle nice levels and weights.  Given the
> nature of bandwidth limits, if necessary, it is straight-forward to
> expose per-task interface.
>
> capacity min/max isn't the same thing.  It isn't a limit on countable
> units of a specific resource and that's why the interface you
> suggested for .min is different.  It's restricting attribute set which
> can be picked in the subhierarchy rather than controlling distribution
> of atoms of the resource.
>
> That's also why we're gonna have problem if we later decide we need a
> thread based API for it.  Once we make cgroup the primary owner of the
> attribute, it's not straight forward to add another owner.

Sorry I don't immediately see why it is not straight forward to have a
per-task API later once CGroup interface is added? Maybe if you don't
mind giving an example that will help?

I can start with an example, say you have a single level hierarchy
(Top-app in Android terms is the set of tasks that are user facing and
we'd like to enforce some capacity minimums, background on the other
hand is the opposite):

                   ROOT (min = 0, max = 1024)
                   /                         \
                  /                           \
              TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)

If in the future, if we want to have a per-task API to individually
configure the task with these limits, it seems it will be straight
forward to implement IMO.

As Patrick mentioned, all of the usecases needing this right now is an
informed runtime placing a task in a group of tasks and not needing to
set attributes for each individual task. We are already placing tasks
in individual CGroups in Android based on the information the runtime
has so adding in the capacity constraints will make it fit naturally
while leaving the door open for any future per-task API additions IMO.

Thanks,

Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23 10:32         ` Patrick Bellasi
  2017-03-23 16:01           ` Tejun Heo
@ 2017-03-24  7:02           ` Joel Fernandes (Google)
  1 sibling, 0 replies; 66+ messages in thread
From: Joel Fernandes (Google) @ 2017-03-24  7:02 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, Linux Kernel Mailing List, linux-pm, Ingo Molnar,
	Peter Zijlstra

Hi Patrick,

On Thu, Mar 23, 2017 at 3:32 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
[..]
>> > which can be used to defined tunable root constraints when CGroups are
>> > not available, and becomes RO when CGroups are.
>> >
>> > Can this be eventually an acceptable option?
>> >
>> > In any case I think that this feature will be mainly targeting CGroup
>> > based systems. Indeed, one of the main goals is to collect
>> > "application specific" information from "informed run-times". Being
>> > "application specific" means that we need a way to classify
>> > applications depending on the runtime context... and that capability
>> > in Linux is ultimately provided via the CGroup interface.
>>
>> I think the concern raised is more about whether CGroups is the right
>> interface to use for attaching capacity constraints to task or groups
>> of tasks, or is there a better way to attach such constraints?
>
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
>
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?
>

I was thinking how about adding per-task constraints to the resource
limits API if it makes sense to? There's already RLIMIT_CPU and
RLIMIT_NICE. An informed-runtime could then modify the limits of tasks
using prlimit.

>> The other advantage of such interface is we don't have to
>> create a separate CGroup for every new constraint limit and can have
>> several tasks with different unique constraints.
>
> That's still possible using CGroups and IMO it will not be the "most
> common case".
> Don't you think that in general we will need to set constraints at
> applications level, thus group of tasks?

Some applications could be a single task, also not all tasks in an
application may need constraints right?

> As a general rule we should probably go for an interface which makes
> easy the most common case.

I agree.

Thanks,
Joel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-24  6:37                 ` Joel Fernandes (Google)
@ 2017-03-24 15:00                   ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2017-03-24 15:00 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Patrick Bellasi, Linux Kernel Mailing List, linux-pm,
	Ingo Molnar, Peter Zijlstra

Hello,

On Thu, Mar 23, 2017 at 11:37:50PM -0700, Joel Fernandes (Google) wrote:
> > That's also why we're gonna have problem if we later decide we need a
> > thread based API for it.  Once we make cgroup the primary owner of the
> > attribute, it's not straight forward to add another owner.
> 
> Sorry I don't immediately see why it is not straight forward to have a
> per-task API later once CGroup interface is added? Maybe if you don't
> mind giving an example that will help?
> 
> I can start with an example, say you have a single level hierarchy
> (Top-app in Android terms is the set of tasks that are user facing and
> we'd like to enforce some capacity minimums, background on the other
> hand is the opposite):
> 
>                    ROOT (min = 0, max = 1024)
>                    /                         \
>                   /                           \
>               TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)
> 
> If in the future, if we want to have a per-task API to individually
> configure the task with these limits, it seems it will be straight
> forward to implement IMO.

Ah, you're right.  I got fixated on controllers which control single
value attributes (the net ones).  Yeah, we can extend the same range
interace to thread level.  Sorry about the confusion.

Not necessarliy specific to the API discussion but in general lower
level in the configuration hierarchy shouldn't be able to restrict
what the parents can do, so we need to think more about how delegation
should work.

> As Patrick mentioned, all of the usecases needing this right now is an
> informed runtime placing a task in a group of tasks and not needing to
> set attributes for each individual task. We are already placing tasks
> in individual CGroups in Android based on the information the runtime
> has so adding in the capacity constraints will make it fit naturally
> while leaving the door open for any future per-task API additions IMO.

But the question is that while the use case can be served by cgroup as
the primary API, it doesn't have to be.  This still is an attribute
range restriction controller rather than an actual hierarchical
resource distributor.  All the controller does is restricting per-task
configuration which is the only thing operative.  This is different
from bandwidth which is an actual resource controller which
distributes cpu cycles hierarchically.

There can be some benefits to having cgroup restrictions on capacity
but they won't add anything functionally fundamental.  So, I still
don't see why making cgroup the primary interface would be a good idea
here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-21 11:01           ` Patrick Bellasi
@ 2017-03-24 23:52             ` Rafael J. Wysocki
  0 siblings, 0 replies; 66+ messages in thread
From: Rafael J. Wysocki @ 2017-03-24 23:52 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Rafael J. Wysocki, Joel Fernandes, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Rafael J . Wysocki, Paul Turner, Jonathan Corbet,
	Vincent Guittot, John Stultz, Todd Kjos, Tim Murray,
	Andres Oportus, Juri Lelli, Morten Rasmussen, Dietmar Eggemann

On Tue, Mar 21, 2017 at 12:01 PM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 20-Mar 23:51, Rafael J. Wysocki wrote:

[cut]

>> So if you want to say "please don't sacrifice performance for these
>> top-apps" to it, chances are it will not understand what you are
>> asking it for. :-)
>
> Actually, this series are the foundation bits of a more complete
> solution, already in use on Pixel phones.
>
> While this proposal focuses just on "OPP biasing", some additional
> bits (not yet posted to keep things simple) exploit the Energy Model
> information to provide support for "task placement biasing".
>
> Those bits address also the concept of:
>
>    how much energy I want to sacrifice to get a certain speedup?

Well, OK, but this reads somewhat like "you can't appreciate that
fully, because you don't know the whole picture". :-)

Which very well may be the case and which is why I'm asking all of
these questions about the motivation etc.: I want to know the whole
picture, because I need context to make up my mind about this
particular part of it in a reasonable way.

[cut]

>> What you are saying generally indicates that you see under-provisioned
>> tasks and that's rather not because the kernel tries to sacrifice
>> performance for energy.  Maybe the CPU utilization is under-estimated
>> by schedutil or the scheduler doesn't give enough time to these
>> particular tasks for some reason.  In any case, having a way to set a
>> limit from user space may allow you to work around these issues quite
>> bluntly and is not a solution.  And even if the underlying problems
>> are solved, the user space interface will stay there and will have to
>> be maintained going forward.
>
> I don't agree on that point, mainly because I don't see that as a
> workaround. In your view you it seems that everything can be solved
> entirely in kernel space.

Now, I haven't said that and it doesn't really reflect my view.

What I actually had in mind was that the particular problems mentioned
by Joel might very well be consequences of what the kernel did even
though it shouldn't be doing that.  If so, then fixing the kernel may
eliminate the problems in question and there may be nothing left on
the table to address with the minimum capacity limit.

> In my view instead what we are after is a
> properly defined interface where kernel-space and user-space can
> potentially close a control loop where:
> a) user-space, which has much more a-priori information about tasks
>    requirements can feed some constraints to kernel-space.
> b) kernel-space, which has optimized end efficient mechanisms, enforce
>    these constraints on a per task basis.

I can agree in principle that *some* kind of interface between the
kernel and user space would be good to have in this area, but I'm not
quite sure about how that interface should look like.

It seems that what needs to be passed is information on what user
space regards as a reasonable energy-for-performance tradeoff,
per-task or overall.

I'm not convinced about the suitability of min/max capacity for this
purpose in general.

> After all this is not a new concept on OS design, we already have
> different interfaces which allows to tune scheduler behaviors on a
> per-task bias. What we are missing right now is a similar _per-task
> interface_ to bias OPP selection and a slightly improved/alternative
> way to bias task placement _without_ doing scheduling decisions in
> user-space.
>
> Here is a graphical representation of these concepts:
>
>       +-------------+    +-------------+  +-------------+
>       | App1 Tasks  ++   | App2 Tasks  ++ | App3 Tasks  ++
>       |             ||   |             || |             ||
>       +--------------|   +--------------| +--------------|
>        +-------------+    +-------------+  +-------------+
>                 |               |              |
>   +----------------------------------------------------------+
>   |                                                          |
>   |      +--------------------------------------------+      |
>   |      |  +-------------------------------------+   |      |
>   |      |  |      Run-Time Optimized Services    |   |      |
>   |      |  |        (e.g. execution model)       |   |      |
>   |      |  +-------------------------------------+   |      |
>   |      |                                            |      |
>   |      |     Informed Run-Time Resource Manager     |      |
>   |      |   (Android, ChromeOS, Kubernets, etc...)   |      |
>   |      +------------------------------------------^-+      |
>   |        |                                        |        |
>   |        |Constraints                             |        |
>   |        |(OPP and Task Placement biasing)        |        |
>   |        |                                        |        |
>   |        |                             Monitoring |        |
>   |      +-v------------------------------------------+      |
>   |      |               Linux Kernel                 |      |
>   |      |         (Scheduler, schedutil, ...)        |      |
>   |      +--------------------------------------------+      |
>   |                                                          |
>   | Closed control and optimization loop                     |
>   +----------------------------------------------------------+
>
> What is important to notice is that there is a middleware, in between
> the kernel and the applications. This is a special kind of user-space
> where it is still safe for the kernel to delegate some "decisions".

So having spent a good part of the last 10 years on writing kernel
code that, among other things, talks to these middlewares (like
autosleep and the support for wakelocks for an obvious example), I'm
quite aware of all that and also quite familiar with the diagram
above.

And while I don't want to start a discussion about whether or not
these middlewares are really as smart as the claims go, let me share a
personal opinion here.  In my experience, they usually tend to be
quite well-informed about the applications shipped along with them,
but not so much about stuff installed by users later, which sometimes
ruins the party like a motorcycle gang dropping in without invitation.

>> Also when you set a minimum frequency limit from user space, you may
>> easily over-provision the task and that would defeat the purpose of
>> what the kernel tries to achieve.
>
> No, if an "informed user-space" wants to over-provision a task it's
> because it has already decided that it makes sense to limit the kernel
> energy optimization for that specific class of tasks.
> It is not necessarily kernel business to know why, it is just required
> to do its best within the provided constraints.

My point is that if user space sets the limit to over-provision a
task, then having the kernel do the whole work to prevent that from
happening is rather pointless.

[cut]

>
>> >>> b) Capping the OPP selection for certain non critical tasks, which is
>> >>>    a major concerns especially for RT tasks in mobile context, but
>> >>>    it also apply to FAIR tasks representing background activities.
>> >>
>> >> Well, is the information on how much CPU capacity assign to those
>> >> tasks really there in user space?  What's the source of it if so?
>> >
>> > I believe this is just a matter of tuning and modeling for what is
>> > needed. For ex. to prevent thermal throttling as I mentioned and also
>> > to ensure background activities aren't running at highest frequency
>> > and consuming excessive energy (since racing to idle at higher
>> > frequency is more expensive energy than running slower to idle since
>> > we run at higher voltages at higher frequency and the slow of the
>> > perf/W curve is steeper - p = c * V^2 * F. So the V component being
>> > higher just drains more power quadratic-ally which is of no use to
>> > background tasks - infact in some tests, we're just as happy with
>> > setting them at much lower frequencies than what load-tracking thinks
>> > is needed.
>>
>> As I said, I actually can see a need to go lower than what performance
>> scaling thinks, because the way it tries to estimate the sufficient
>> capacity is by checking how much utilization is there for the
>> currently provided capacity and adjusting if necessary.  OTOH, there
>> are applications aggressive enough to be able to utilize *any*
>> capacity provided to them.
>
> Here you are not considering the control role exercised by the
> middleware layer.

Indeed.  I was describing what happened without it. :-)

[cut]

>
> Interesting discussion, thanks! ;-)

Yup, thanks!

Take care,
Rafael

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-23 18:39               ` Tejun Heo
  2017-03-24  6:37                 ` Joel Fernandes (Google)
@ 2017-03-30 21:13                 ` Paul Turner
  1 sibling, 0 replies; 66+ messages in thread
From: Paul Turner @ 2017-03-30 21:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Patrick Bellasi, Joel Fernandes (Google),
	Linux Kernel Mailing List, linux-pm, Ingo Molnar, Peter Zijlstra

There is one important, fundamental difference here:
 {cfs,rt}_{period,runtime}_us is a property that applies to a group of
threads, it can be sub-divided.
We can consume 100ms of quota either by having one thread run for
100ms, or 2 threads running for 50ms.

This is not true for capacity.  It's a tag that affects the individual
threads it's applied to.
I'm also not sure if it's a hard constraint.  For example, suppose we
set a max that is smaller than a "big" cpu on an asymmetric system.
In the case that the faster CPU is relatively busy, but still
opportunistically available, we would still want to schedule it there.

This definitely seems to make more sense as a per-thread interface in
its current form.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-20 18:08     ` Patrick Bellasi
  2017-03-23  0:28       ` Joel Fernandes (Google)
@ 2017-03-30 21:15       ` Paul Turner
  2017-04-01 16:25         ` Patrick Bellasi
  1 sibling, 1 reply; 66+ messages in thread
From: Paul Turner @ 2017-03-30 21:15 UTC (permalink / raw)
  To: Patrick Bellasi; +Cc: Tejun Heo, LKML, linux-pm, Ingo Molnar, Peter Zijlstra

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:
>> > This patch extends the CPU controller by adding a couple of new
>> > attributes, capacity_min and capacity_max, which can be used to enforce
>> > bandwidth boosting and capping. More specifically:
>> >
>> > - capacity_min: defines the minimum capacity which should be granted
>> >                 (by schedutil) when a task in this group is running,
>> >                 i.e. the task will run at least at that capacity
>> >
>> > - capacity_max: defines the maximum capacity which can be granted
>> >                 (by schedutil) when a task in this group is running,
>> >                 i.e. the task can run up to that capacity
>>
>> cpu.capacity.min and cpu.capacity.max are the more conventional names.
>
> Ok, should be an easy renaming.
>
>> I'm not sure about the name capacity as it doesn't encode what it does
>> and is difficult to tell apart from cpu bandwidth limits.  I think
>> it'd be better to represent what it controls more explicitly.
>
> In the scheduler jargon, capacity represents the amount of computation
> that a CPU can provide and it's usually defined to be 1024 for the
> biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> maximum frequency).
>
> It's true that it kind of overlaps with the concept of "bandwidth".
> However, the main difference here is that "bandwidth" is not frequency
> (and architecture) scaled.
> Thus, for example, assuming we have only one CPU with these two OPPs:
>
>    OPP | Frequency | Capacity
>      1 |    500MHz |      512
>      2 |      1GHz |     1024

I think exposing capacity in this manner is extremely challenging.
It's not normalized in any way between architectures, which places a
lot of the ABI in the API.

Have you considered any schemes for normalizing this in a reasonable fashion?
`
>
> a task running 60% of the time on that CPU when configured to run at
> 500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
> the capacity standpoint, is using only 30% of the available capacity.
>
> IOW, bandwidth is purely temporal based while capacity factors in both
> frequency and architectural differences.
> Thus, while a "bandwidth" constraint limits the amount of time a task
> can use a CPU, independently from the "actual computation" performed,
> with the new "capacity" constraints we can enforce much "actual
> computation" a task can perform in the "unit of time".
>
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>
> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.
>
>> > b) allow to create subgroups of tasks which are not violating the
>> >    capacity constraints defined by the parent group.
>> >    Thus, tasks on a subgroup can only be more boosted and/or more
>>
>> For both limits and protections, the parent caps the maximum the
>> children can get.  At least that's what memcg does for memory.low.
>> Doing that makes sense for memcg because for memory the parent can
>> still do protections regardless of what its children are doing and it
>> makes delegation safe by default.
>
> Just to be more clear, the current proposal enforces:
>
> - capacity_max_child <= capacity_max_parent
>
>   Since, if a task is constrained to get only up to a certain amount
>   of capacity, than its childs cannot use more than that... eventually
>   they can only be further constrained.
>
> - capacity_min_child >= capacity_min_parent
>
>   Since, if a task has been boosted to run at least as much fast, than
>   its childs cannot be constrained to go slower without eventually
>   impacting parent performance.
>
>> I understand why you would want a property like capacity to be the
>> other direction as that way you get more specific as you walk down the
>> tree for both limits and protections;
>
> Right, the protection schema is defined in such a way to never affect
> parent constraints.
>
>> however, I think we need to
>> think a bit more about it and ensure that the resulting interface
>> isn't confusing.
>
> Sure.
>
>> Would it work for capacity to behave the other
>> direction - ie. a parent's min restricting the highest min that its
>> descendants can get?  It's completely fine if that's weird.
>
> I had a thought about that possibility and it was not convincing me
> from the use-cases standpoint, at least for the ones I've considered.
>
> Reason is that capacity_min is used to implement a concept of
> "boosting" where, let say we want to "run a task faster then a minimum
> frequency". Assuming that this constraint has been defined because we
> know that this task, and likely all its descendant threads, needs at
> least that capacity level to perform according to expectations.
>
> In that case the "refining down the hierarchy" can require to boost
> further some threads but likely not less.
>
> Does this make sense?
>
> To me this seems to match quite well at least Android/ChromeOS
> specific use-cases. I'm not sure if there can be other different
> use-cases in the domain for example of managed containers.
>
>
>> Thanks.
>>
>> --
>> tejun
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
  2017-03-30 21:15       ` Paul Turner
@ 2017-04-01 16:25         ` Patrick Bellasi
  0 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-01 16:25 UTC (permalink / raw)
  To: Paul Turner; +Cc: Tejun Heo, LKML, linux-pm, Ingo Molnar, Peter Zijlstra

Hi Paul,

On 30-Mar 14:15, Paul Turner wrote:
> On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > On 20-Mar 13:15, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Feb 28, 2017 at 02:38:38PM +0000, Patrick Bellasi wrote:
> >> > This patch extends the CPU controller by adding a couple of new
> >> > attributes, capacity_min and capacity_max, which can be used to enforce
> >> > bandwidth boosting and capping. More specifically:
> >> >
> >> > - capacity_min: defines the minimum capacity which should be granted
> >> >                 (by schedutil) when a task in this group is running,
> >> >                 i.e. the task will run at least at that capacity
> >> >
> >> > - capacity_max: defines the maximum capacity which can be granted
> >> >                 (by schedutil) when a task in this group is running,
> >> >                 i.e. the task can run up to that capacity
> >>
> >> cpu.capacity.min and cpu.capacity.max are the more conventional names.
> >
> > Ok, should be an easy renaming.
> >
> >> I'm not sure about the name capacity as it doesn't encode what it does
> >> and is difficult to tell apart from cpu bandwidth limits.  I think
> >> it'd be better to represent what it controls more explicitly.
> >
> > In the scheduler jargon, capacity represents the amount of computation
> > that a CPU can provide and it's usually defined to be 1024 for the
> > biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> > maximum frequency).
> >
> > It's true that it kind of overlaps with the concept of "bandwidth".
> > However, the main difference here is that "bandwidth" is not frequency
> > (and architecture) scaled.
> > Thus, for example, assuming we have only one CPU with these two OPPs:
> >
> >    OPP | Frequency | Capacity
> >      1 |    500MHz |      512
> >      2 |      1GHz |     1024
> 
> I think exposing capacity in this manner is extremely challenging.
> It's not normalized in any way between architectures, which places a
> lot of the ABI in the API.

Capacities of CPUs are already exposed, at least for ARM platforms, using
a platform independent definition which is documented here:

   http://lxr.free-electrons.com/source/Documentation/devicetree/bindings/arm/cpus.txt#L245

As the notes in the documentation highlight, it's not a perfect
metrics but still it allows to distinguish between computational
capabilities of different (micro)architectures and/or OPPs.

Within the scheduler we use SCHED_CAPACITY_SCALE:
   http://lxr.free-electrons.com/ident?i=SCHED_CAPACITY_SCALE
for everything related to actual CPU computational capabilities.

That's why in the current implementation we expose the same metric to
define capacity constraints.

We considered also the idea to expose a more generic percentage value
[0..100], do you think that could be better?
Consider that at the end we will still have to scale 100% to 1024.

> Have you considered any schemes for normalizing this in a reasonable fashion?

For each specific target platform, capacities are already normalized
to 1024, which is the capacity of the most capable CPU running at the
highest OPP. Thus, 1024 always represents 100% of the available
computational capabilities of the most preforming system's CPU.

Perhaps I cannot completely get what what you mean when you say that
it should be "normalized between architectures".
Can you explain better, maybe with an example?

[...]

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-03-20 17:22   ` Patrick Bellasi
@ 2017-04-10  7:36     ` Peter Zijlstra
  2017-04-11 17:58       ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-10  7:36 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Morten Rasmussen, Dietmar Eggemann

On Mon, Mar 20, 2017 at 05:22:33PM +0000, Patrick Bellasi wrote:

> a) Bias OPP selection.
>    Thus granting that certain critical tasks always run at least at a
>    specified frequency.
> 
> b) Bias TASKS placement, which requires an additional extension not
>    yet posted to keep things simple.
>    This allows heterogeneous systems, where different CPUs have
>    different capacities, to schedule important tasks in more capable
>    CPUs.

So I dislike the capacity min/max knobs because:

 1) capacity is more an implementation detail than a primary metric;
    illustrated per your above points in that it affects both, while in
    fact it actually modifies another metric, namely util_avg.

 2) they look like an absolute clamp for a best effort class; while it
    very much is not. This also leads to very confusing nesting
    properties re cgroup.

 3) they have absolutely unacceptable overhead in implementation. Two
    more RB tree operations per enqueue/dequeue is just not going to
    happen.

 4) they have muddled semantics, because while its presented as a task
    property, it very much is not. The max(min) and min(max) of all
    runnable tasks is applied to the aggregate util signal. Also see 2.


So 'nice' is a fairly well understood knob; it controls relative time
received between competing tasks (and we really should replace the cpu
controllers 'shares' file with a 'nice' file, see below).


I would much rather introduce something like nice-opp, which only
affects the OPP selection in a relative fashion. This immediately also
has a natural and obvious extension to cgroup hierarchies (just like
regular nice).


And could be folded as a factor in util_avg (just as we do with weight
in load_avg), although this will mess up everything else we use util for
:/. Or, alternatively, decompose the aggregate value upon usage and only
apply the factor for the current running task or something.... Blergh..
difficult..




---
 kernel/sched/core.c  | 29 +++++++++++++++++++++--------
 kernel/sched/fair.c  |  2 +-
 kernel/sched/sched.h |  1 +
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27b4dd55b0c7..20ed17369cb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6963,18 +6963,31 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
-				struct cftype *cftype, u64 shareval)
+static int cpu_nice_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, s64 val)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	struct task_group *tg = css_tg(css);
+	unsigned long weight;
+
+	int ret;
+
+	if (val < MIN_NICE || val > MAX_NICE)
+		return -EINVAL;
+
+	weight = sched_prio_to_weight[val - MIN_NICE];
+
+	ret = sched_group_set_shares(tg, scale_load(weight));
+
+	if (!ret)
+		tg->nice = val;
 }
 
-static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
+static u64 cpu_shares_read_s64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (s64)tg->nice;
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -7252,9 +7265,9 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
-		.name = "shares",
-		.read_u64 = cpu_shares_read_u64,
-		.write_u64 = cpu_shares_write_u64,
+		.name = "nice",
+		.read_s64 = cpu_nice_read_s64,
+		.write_s64 = cpu_nice_write_s64,
 	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76f67b3e34d6..8088043f46d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9471,7 +9471,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, MIN_SHARES, scale_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 57caf3606114..27f1ffe763bc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -283,6 +283,7 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+	int nice;
 
 #ifdef	CONFIG_SMP
 	/*

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-10  7:36     ` Peter Zijlstra
@ 2017-04-11 17:58       ` Patrick Bellasi
  2017-04-12 12:10         ` Peter Zijlstra
                           ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-11 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

Hi Peter,
sorry for this pretty long response, but I think that if you will have
time to patiently go through all the following comments and questions
it will be a big step forward on defining what we really want.

On 10-Apr 09:36, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 05:22:33PM +0000, Patrick Bellasi wrote:
> 
> > a) Bias OPP selection.
> >    Thus granting that certain critical tasks always run at least at a
> >    specified frequency.
> > 
> > b) Bias TASKS placement, which requires an additional extension not
> >    yet posted to keep things simple.
> >    This allows heterogeneous systems, where different CPUs have
> >    different capacities, to schedule important tasks in more capable
> >    CPUs.
> 
> So I dislike the capacity min/max knobs because:

>From your following points I see two main topics: "fundamental
concepts" and "implementation details".

Let's star from the "fundamental concepts", since I think we are not
yet aligned on the overall view and goals for this proposal.


.:: Fundamental Concepts

>  1) capacity is more an implementation detail than a primary metric;

Capacity has been defined in a "platform independent" way, it's a
metric which is normalized to 1024 for the most capable CPU running at
the maximum OPP.

To me it seems it can be considered as a primary metric, maybe not in
the current form, i.e. by exposing a raw number in [0..1024] range.

We should consider also that at the CPUFreq side we already expose
knobs like scaling_{min,max}_freq which are much more platform
dependant than capacity.

>     illustrated per your above points in that it affects both, while in
>     fact it actually modifies another metric, namely util_avg.

I don't see it modifying in any direct way util_avg.

Capacity_{min,max} are tracked independently from util_avg and used
"on demand" to "clamp" the original util_avg signal.

There are two main concepts I would like to know your opinion about:

1) we like to have a support to bias OPPs selection and tasks placement

2) since util_avg is directly affecting both OPPs and tasks placement,
   we can consider to somehow "bias" the "usage of" util_avg to
   achieve these goals

IMHO, if our goal is to bias OPP selection and tasks placement, we
should consider to "work on top of" util_avg thus getting a better
separation of concerns:
- util_avg is (already) used to define how utilization is "produced"
- capacity_{min,max} is used to _bias_ how utilization is "consumed"
  i.e. by schedutil and the FAIR scheduler).


>  2) they look like an absolute clamp for a best effort class; while it
>     very much is not. This also leads to very confusing nesting
>     properties re cgroup.

If by absolute you mean something mandatory, I agree. We are still in
the best effort domain. But perhaps it's just a naming issue.

We want to request a possible minimum capacity, for example to
prefer a big CPUs vs a LITTLE one.
However, being "not mandatory" to me does not mean that it has to be
necessary "more abstract" like the one you are proposing below.

Since, we are defining a "tuning interface" what is the _possible_
effect of a certain value should be easy to understand and to map on a
possible effect.
To me the proposed capacity clamping satisfies this goal while in a
previous proposal, where we advanced the idea of a more generic
"boost" value, there was complains (e.g. from PJT) about not being
really clear what was the possible end result.

Sorry, I don't get instead what are the "confusing nesting properties"
you are referring to?

>  4) they have muddled semantics, because while its presented as a task
>     property, it very much is not.

Actually we always presented it as a task group property, while other
people suggested it should be a per-task API.

Still, IMO it could make sense also as a per-task API, for example
considering a specific RT task which we know in our system is
perfectly fine to always run it below a certain OPP.

Do you think it should be a per-task API?
Should we focus (at least initially) on providing a per task-group API?

>     The max(min) and min(max) of all
>     runnable tasks is applied to the aggregate util signal. Also see 2.

Regarding the max/min aggregations they are an intended feature.
The idea is to implement a sort-of (best-effort of course)
inheritance mechanism.

For example, if we have these two RUNNABLE tasks on a CPU:
 Task A, capacity_min=200
 Task B, capacity_min=400
then we want to run the CPU at least at 40% capacity even when A is
running while B is enqueued.

Don't you think this makes sense?


.:: Implementation details

>  3) they have absolutely unacceptable overhead in implementation. Two
>     more RB tree operations per enqueue/dequeue is just not going to
>     happen.

This last point is about "implementation details", I'm pretty
confident that if we find an agreement on the previous point than
this last will be simple to solve.

Just to be clear, the rb-trees are per CPU and used to track just the
RUNNABLE tasks on each CPUs and, as I described in the previous
example, for the OPP biasing to work I think we really need an
aggregation mechanism.

Ideally, every time a task is enqueue/dequeued from a CPU we want to
know what is the currently required capacity clamping. This requires
to maintain an ordered list of values and rb-trees are the most effective
solution.

Perhaps, if we accept a certain level of approximation, we can
potentially reduce the set of tracked values to a finite set (maybe
corresponding to the EM capacity values) and use just a per-cpu
ref-counting array.
Still the min/max aggregation will have a worst case O(N) search
complexity, where N is the number of finite values we want to use.

Do you think such a solution can work better?

Just for completeness I've reported at the end an overhead analysis
for the current implementation, have a look only if the response to
the previous question was negative ;-)

> So 'nice' is a fairly well understood knob; it controls relative time
> received between competing tasks (and we really should replace the cpu
> controllers 'shares' file with a 'nice' file, see below).

Agree on that point, it's an interesting cleanup/consolidation,
however I cannot see it fitting for our goals...

> I would much rather introduce something like nice-opp, which only
> affects the OPP selection in a relative fashion. This immediately also
> has a natural and obvious extension to cgroup hierarchies (just like
> regular nice).

... affecting OPPs in a relative fashion sounds like a nice abstract
interface but I'm not convinced it can do the job:

a) We should defined what "relative" means.
   While time partitioning between competitive tasks makes perfectly
   sense, I struggle to find how we can boost a task relatively to
   other co-scheduled tasks.
   Does it even make any sense?

b) Nice values have this 10% time-delta each step which will be odd to
   map on OPPs selection.
   What it means to boost a task 10% less than another?

c) Nice values have a well defined discrete range [-20..19].
   Does is make sense to have such a range for OPPs biasing?
   Using a 10% relative delta for each step does not makes this range
   to wide?
   For example, at 5 OPP nice we can probably be already at the minimum OPP.

d) How do we use nice-opp to bias tasks placement?


> And could be folded as a factor in util_avg (just as we do with weight
> in load_avg), although this will mess up everything else we use util for
> :/.

Not if we use get() methods which are called from the code places
where we are interested in getting the biased value.

> Or, alternatively, decompose the aggregate value upon usage and only
> apply the factor for the current running task or something.... Blergh..
> difficult..

Maybe just because we are trying to abstract too much a relatively
simple concept?

Still I don't completely get the downside of using a simple, well
defined and generic concept like "capacity"...


> ---
>  kernel/sched/core.c  | 29 +++++++++++++++++++++--------
>  kernel/sched/fair.c  |  2 +-
>  kernel/sched/sched.h |  1 +
>  3 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 27b4dd55b0c7..20ed17369cb1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6963,18 +6963,31 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
>  }
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
> -				struct cftype *cftype, u64 shareval)
> +static int cpu_nice_write_s64(struct cgroup_subsys_state *css,
> +				struct cftype *cftype, s64 val)
>  {
> -	return sched_group_set_shares(css_tg(css), scale_load(shareval));
> +	struct task_group *tg = css_tg(css);
> +	unsigned long weight;
> +
> +	int ret;
> +
> +	if (val < MIN_NICE || val > MAX_NICE)
> +		return -EINVAL;
> +
> +	weight = sched_prio_to_weight[val - MIN_NICE];
> +
> +	ret = sched_group_set_shares(tg, scale_load(weight));
> +
> +	if (!ret)
> +		tg->nice = val;
>  }
> 
> -static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
> +static u64 cpu_shares_read_s64(struct cgroup_subsys_state *css,
                 ^^^^^^^
you mean:     cpu_nice_write_s64
>  			       struct cftype *cft)
>  {
>  	struct task_group *tg = css_tg(css);
> 
> -	return (u64) scale_load_down(tg->shares);
> +	return (s64)tg->nice;
>  }
> 
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -7252,9 +7265,9 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  static struct cftype cpu_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	{
> -		.name = "shares",
> -		.read_u64 = cpu_shares_read_u64,
> -		.write_u64 = cpu_shares_write_u64,
> +		.name = "nice",
> +		.read_s64 = cpu_nice_read_s64,
> +		.write_s64 = cpu_nice_write_s64,

However, isn't this going to break a user-space API?

>  	},
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76f67b3e34d6..8088043f46d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9471,7 +9471,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  	if (!tg->se[0])
>  		return -EINVAL;
> 
> -	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
> +	shares = clamp(shares, MIN_SHARES, scale_load(MAX_SHARES));
                               ^^^^^^^^^^

Why you remove the scaling here?

> 
>  	mutex_lock(&shares_mutex);
>  	if (tg->shares == shares)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 57caf3606114..27f1ffe763bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -283,6 +283,7 @@ struct task_group {
>  	/* runqueue "owned" by this group on each cpu */
>  	struct cfs_rq **cfs_rq;
>  	unsigned long shares;
> +	int nice;
> 
>  #ifdef	CONFIG_SMP
>  	/*


.:: Overhead analysis for the current solution

Here are the stats on completion times I've got for 30 runs of:
   perf bench sched messaging --pipe --thread --group 1 --loop 5000
with tasks pinned in the two big CPUs of a Juno2 board and using the
performance governor:

        count   mean    std    min    25%    50%    75%   max
with:    30.0  4.666  0.164  4.228  4.585  4.649  4.781  4.95
without: 30.0  4.830  0.178  4.444  4.751  4.813  4.951  5.14

The average slowdown is ~3.5%, which we can easily agree it's not
negligible.

However, a couple of considerations are still worth:

1) this is certainly not the target use-case, the proposed API is not
   targeting server and/or high intensive workloads. In these cases
   this support should not be enabled at kernel compilation time.
   We are mainly targeting low-utilization and latency sensitive
   workloads.

2) the current implementation perhaps can be optimized by disabling it
   at run-time under certain working conditions, e.g. using something
   similar to what we do for EAS once we cross the tipping-point and the
   system is marked as over-utilized.

3) there is a run-time overhead, of course, but if it gives
   energy/performance benefits for certain low-utilization workloads
   it's still worth to be payed.



-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-11 17:58       ` Patrick Bellasi
@ 2017-04-12 12:10         ` Peter Zijlstra
  2017-04-12 13:55           ` Patrick Bellasi
  2017-04-12 12:15         ` Peter Zijlstra
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 12:10 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

Let me reply in parts as I read this.. easy things first :-)

On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> On 10-Apr 09:36, Peter Zijlstra wrote:

> >  4) they have muddled semantics, because while its presented as a task
> >     property, it very much is not.
> 
> Actually we always presented it as a task group property, while other
> people suggested it should be a per-task API.
> 
> Still, IMO it could make sense also as a per-task API, for example
> considering a specific RT task which we know in our system is
> perfectly fine to always run it below a certain OPP.
> 
> Do you think it should be a per-task API?
> Should we focus (at least initially) on providing a per task-group API?

Even for the cgroup interface, I think they should set a per-task
property, not a group property.

> >  3) they have absolutely unacceptable overhead in implementation. Two
> >     more RB tree operations per enqueue/dequeue is just not going to
> >     happen.
> 
> This last point is about "implementation details", I'm pretty
> confident that if we find an agreement on the previous point than
> this last will be simple to solve.
> 
> Just to be clear, the rb-trees are per CPU and used to track just the
> RUNNABLE tasks on each CPUs and, as I described in the previous
> example, for the OPP biasing to work I think we really need an
> aggregation mechanism.

I know its runnable, which is exactly what the regular RB tree in fair
already tracks. That gets us 3 RB tree operations per scheduling
operation, which is entirely ridiculous.

And while I disagree with the need to track this at all, see below, it
can be done _much_ cheaper using a double augmented RB-tree, where you
keep the min/max as heaps inside the regular RB-tree.

> Ideally, every time a task is enqueue/dequeued from a CPU we want to
> know what is the currently required capacity clamping. This requires
> to maintain an ordered list of values and rb-trees are the most effective
> solution.
> 
> Perhaps, if we accept a certain level of approximation, we can
> potentially reduce the set of tracked values to a finite set (maybe
> corresponding to the EM capacity values) and use just a per-cpu
> ref-counting array.
> Still the min/max aggregation will have a worst case O(N) search
> complexity, where N is the number of finite values we want to use.

So the bigger point is that if the min/max is a per-task property (even
if set through a cgroup interface), the min(max) / max(min) thing is
wrong.

If the min/max were to apply to each individual task's util, you'd end
up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].

Where a possible approximation is scaling the aggregate util into that
domain.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-11 17:58       ` Patrick Bellasi
  2017-04-12 12:10         ` Peter Zijlstra
@ 2017-04-12 12:15         ` Peter Zijlstra
  2017-04-12 13:34           ` Patrick Bellasi
  2017-04-12 12:22         ` Peter Zijlstra
  2017-04-12 12:48         ` Peter Zijlstra
  3 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 12:15 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> We should consider also that at the CPUFreq side we already expose
> knobs like scaling_{min,max}_freq which are much more platform
> dependant than capacity.

So I've always objected to these knobs.

That said; they are a pre-existing user interface so changing them isn't
really feasible much.

But at the very least we should integrate them properly. Which for
schedutil would mean they have to directly modify the relevant CPU
capacity values. Is this currently done? (I think not.)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-11 17:58       ` Patrick Bellasi
  2017-04-12 12:10         ` Peter Zijlstra
  2017-04-12 12:15         ` Peter Zijlstra
@ 2017-04-12 12:22         ` Peter Zijlstra
  2017-04-12 13:24           ` Patrick Bellasi
  2017-04-12 12:48         ` Peter Zijlstra
  3 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 12:22 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> Sorry, I don't get instead what are the "confusing nesting properties"
> you are referring to?

If a parent group sets min=.2 and max=.8, what are the constraints on
its child groups for setting their resp min and max?

I can't immediately gives rules that would make sense.

For instance, allowing a child to lower min would violate the parent
constraint, while allowing a child to increase min would grant the child
more resources than the parent.

Neither seem like a good thing.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-11 17:58       ` Patrick Bellasi
                           ` (2 preceding siblings ...)
  2017-04-12 12:22         ` Peter Zijlstra
@ 2017-04-12 12:48         ` Peter Zijlstra
  2017-04-12 13:27           ` Patrick Bellasi
  3 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 12:48 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> >     illustrated per your above points in that it affects both, while in
> >     fact it actually modifies another metric, namely util_avg.
> 
> I don't see it modifying in any direct way util_avg.

The point is that clamps called 'capacity' are applied to util. So while
you don't modify util directly, you do modify the util signal (for one
consumer).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 12:22         ` Peter Zijlstra
@ 2017-04-12 13:24           ` Patrick Bellasi
  0 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-12 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 14:22, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > Sorry, I don't get instead what are the "confusing nesting properties"
> > you are referring to?
> 
> If a parent group sets min=.2 and max=.8, what are the constraints on
> its child groups for setting their resp min and max?

Currently the logic I'm proposing enforces this:

a) capacity_max can only be reduced
   because we accept that a child can be further constrained
   for example:
   - a resource manager allocates a max capacity to an application
   - the application itself knows that some of its child are background
     tasks and they can be further constrained

b) capacity_min can only be increased
   because we want to inhibit child affecting overall performance
   for example:
   - a resource manager allocates a minimum capacity to an application
   - the application itself cannot slow-down some of its child
     without risking to affect other (unknown) external entities

> I can't immediately gives rules that would make sense.

The second rule is more tricky, but I see it matching better an
overall decomposition schema where a single resource manager is
allocating a capacity_min to two different entities (A and B) which
are independent but (it only knows) are also cooperating.

Let's think about the Android run-time which allocate resources to a
system service (entity A) which it knows it has to interact with
a certain app (entity B).

The cooperation dependency can be resolved only by the resource
manager, by assigning capacity_min at entity level CGroups.
Thus, entities subgroups should not be allowed to further reduce
this constraint without risking to impact an (unknown for them)
external entity.

> For instance, allowing a child to lower min would violate the parent
> constraint,

Quite likely don't want this.

> while allowing a child to increase min would grant the child
> more resources than the parent.

But still within the capacity_max enforced by the parent.

We should always consider the pair (min,max), once a parent defined
this range to me it's seem ok that child can freely play within that
range.

Why should not be allowed a child group to set:

   capacity_min_child = capacity_max_parent

?


> Neither seem like a good thing.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 12:48         ` Peter Zijlstra
@ 2017-04-12 13:27           ` Patrick Bellasi
  2017-04-12 14:34             ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-12 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 14:48, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > >     illustrated per your above points in that it affects both, while in
> > >     fact it actually modifies another metric, namely util_avg.
> > 
> > I don't see it modifying in any direct way util_avg.
> 
> The point is that clamps called 'capacity' are applied to util. So while
> you don't modify util directly, you do modify the util signal (for one
> consumer).

Right, but this consumer (i.e. schedutil) it's already translating
the util_avg into a next_freq (which ultimately it's a capacity).

Thus, I don't see a big misfit in that code path to "filter" this
translation with a capacity clamp.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 12:15         ` Peter Zijlstra
@ 2017-04-12 13:34           ` Patrick Bellasi
  2017-04-12 14:41             ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-12 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 14:15, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > We should consider also that at the CPUFreq side we already expose
> > knobs like scaling_{min,max}_freq which are much more platform
> > dependant than capacity.
> 
> So I've always objected to these knobs.
> 
> That said; they are a pre-existing user interface so changing them isn't
> really feasible much.
> 
> But at the very least we should integrate them properly. Which for
> schedutil would mean they have to directly modify the relevant CPU
> capacity values. Is this currently done? (I think not.)

AFAICS they are clamping the policy decisions, thus the frequency
domain... which can be more then one CPU on ARM platforms.

When you say you would like them to "directly modify the relevant CPU
capacity values" I really see this as exactly what we do with
capacity_{min,max}.

These two capacity clamping values are per task, and thus (by
definition) per CPU. Moreover, they have the interesting property to
be "aggregated" in such a way that, in this configuration:

  TaskA: capacity_max:  20%
  TaskB: capacity_max: 100%

when both tasks are RUNNABLE on the same CPU, that CPU is not capped
until TaskB leave the CPU.

Do you think such a kind of feature can be useful?

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 12:10         ` Peter Zijlstra
@ 2017-04-12 13:55           ` Patrick Bellasi
  2017-04-12 15:37             ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-12 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 14:10, Peter Zijlstra wrote:
> Let me reply in parts as I read this.. easy things first :-)
> 
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > On 10-Apr 09:36, Peter Zijlstra wrote:
> 
> > >  4) they have muddled semantics, because while its presented as a task
> > >     property, it very much is not.
> > 
> > Actually we always presented it as a task group property, while other
> > people suggested it should be a per-task API.
> > 
> > Still, IMO it could make sense also as a per-task API, for example
> > considering a specific RT task which we know in our system is
> > perfectly fine to always run it below a certain OPP.
> > 
> > Do you think it should be a per-task API?
> > Should we focus (at least initially) on providing a per task-group API?
> 
> Even for the cgroup interface, I think they should set a per-task
> property, not a group property.

Ok, right now using CGroups ans primary (and unique) interface, these
values are tracked as attributes of the CPU controller. Tasks gets
them by reading these attributes once they are binded to a CGroup.

Are you proposing to move these attributes within the task_struct?

In that case we should also defined a primary interface to set them,
any preferred proposal? sched_setattr(), prctl?

> > >  3) they have absolutely unacceptable overhead in implementation. Two
> > >     more RB tree operations per enqueue/dequeue is just not going to
> > >     happen.
> > 
> > This last point is about "implementation details", I'm pretty
> > confident that if we find an agreement on the previous point than
> > this last will be simple to solve.
> > 
> > Just to be clear, the rb-trees are per CPU and used to track just the
> > RUNNABLE tasks on each CPUs and, as I described in the previous
> > example, for the OPP biasing to work I think we really need an
> > aggregation mechanism.
> 
> I know its runnable, which is exactly what the regular RB tree in fair
> already tracks. That gets us 3 RB tree operations per scheduling
> operation, which is entirely ridiculous.
> 
> And while I disagree with the need to track this at all, see below, it
> can be done _much_ cheaper using a double augmented RB-tree, where you
> keep the min/max as heaps inside the regular RB-tree.

By regular rb-tree do you mean the cfs_rq->tasks_timeline?

Because in that case this would apply only to the FAIR class, while
the rb-tree we are using here are across classes.
Supporting both FAIR and RT I think is a worth having feature.

> > Ideally, every time a task is enqueue/dequeued from a CPU we want to
> > know what is the currently required capacity clamping. This requires
> > to maintain an ordered list of values and rb-trees are the most effective
> > solution.
> > 
> > Perhaps, if we accept a certain level of approximation, we can
> > potentially reduce the set of tracked values to a finite set (maybe
> > corresponding to the EM capacity values) and use just a per-cpu
> > ref-counting array.
> > Still the min/max aggregation will have a worst case O(N) search
> > complexity, where N is the number of finite values we want to use.
> 
> So the bigger point is that if the min/max is a per-task property (even
> if set through a cgroup interface), the min(max) / max(min) thing is
> wrong.

Perhaps I'm not following you here but, being per-task does not mean
that we need to aggregate these constraints by summing them (look
below)...

> If the min/max were to apply to each individual task's util, you'd end
> up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].

... as you do here.

Let's use the usual simple example, where these per-tasks constraints
are configured:
- TaskA: capacity_min: 20% capacity_max: 80%
- TaskB: capacity_min: 40% capacity_max: 60%

This means that, at CPU level, we want to enforce the following
clamping depending on the tasks status:

 RUNNABLE tasks    capacity_min    capacity_max
A) TaskA                      20%             80%
B) TaskA,TaskB                40%             80%
C) TaskB                      40%             60%
 
In case C, TaskA gets a bigger boost while is co-scheduled with TaskB.

Notice that this CPU-level aggregation is used just for OPP selection
on that CPU, while for TaskA we still use capacity_min=20% when we are
looking for a CPU.

> Where a possible approximation is scaling the aggregate util into that
> domain.
> 

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 13:27           ` Patrick Bellasi
@ 2017-04-12 14:34             ` Peter Zijlstra
  2017-04-12 14:43               ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 14:34 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:48, Peter Zijlstra wrote:
> > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > >     illustrated per your above points in that it affects both, while in
> > > >     fact it actually modifies another metric, namely util_avg.
> > > 
> > > I don't see it modifying in any direct way util_avg.
> > 
> > The point is that clamps called 'capacity' are applied to util. So while
> > you don't modify util directly, you do modify the util signal (for one
> > consumer).
> 
> Right, but this consumer (i.e. schedutil) it's already translating
> the util_avg into a next_freq (which ultimately it's a capacity).
> 
> Thus, I don't see a big misfit in that code path to "filter" this
> translation with a capacity clamp.

Still strikes me as odd though.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 13:34           ` Patrick Bellasi
@ 2017-04-12 14:41             ` Peter Zijlstra
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 14:41 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Wed, Apr 12, 2017 at 02:34:48PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:15, Peter Zijlstra wrote:
> > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > We should consider also that at the CPUFreq side we already expose
> > > knobs like scaling_{min,max}_freq which are much more platform
> > > dependant than capacity.
> > 
> > So I've always objected to these knobs.
> > 
> > That said; they are a pre-existing user interface so changing them isn't
> > really feasible much.
> > 
> > But at the very least we should integrate them properly. Which for
> > schedutil would mean they have to directly modify the relevant CPU
> > capacity values. Is this currently done? (I think not.)
> 
> AFAICS they are clamping the policy decisions, thus the frequency
> domain... which can be more then one CPU on ARM platforms.

Right, knew that. Must've missed the 's' when typing ;-)

> When you say you would like them to "directly modify the relevant CPU
> capacity values" I really see this as exactly what we do with
> capacity_{min,max}.

What I mean is that when we set max as lower than max-freq, it should
reduce the CPU(s)' capacity. That's something entirely different from what
you're attempting (and not something we do currently afaik).

Also; there's as yet unexplored interaction between these knobs and the
RT bandwidth limits.

But the main point is that these knobs are system wide and do not affect
tasks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 14:34             ` Peter Zijlstra
@ 2017-04-12 14:43               ` Patrick Bellasi
  2017-04-12 16:14                 ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-12 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 16:34, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> > On 12-Apr 14:48, Peter Zijlstra wrote:
> > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > > >     illustrated per your above points in that it affects both, while in
> > > > >     fact it actually modifies another metric, namely util_avg.
> > > > 
> > > > I don't see it modifying in any direct way util_avg.
> > > 
> > > The point is that clamps called 'capacity' are applied to util. So while
> > > you don't modify util directly, you do modify the util signal (for one
> > > consumer).
> > 
> > Right, but this consumer (i.e. schedutil) it's already translating
> > the util_avg into a next_freq (which ultimately it's a capacity).
> > 
> > Thus, I don't see a big misfit in that code path to "filter" this
> > translation with a capacity clamp.
> 
> Still strikes me as odd though.

Can you better elaborate on they why?

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 13:55           ` Patrick Bellasi
@ 2017-04-12 15:37             ` Peter Zijlstra
  2017-04-13 11:33               ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 15:37 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Wed, Apr 12, 2017 at 02:55:38PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:10, Peter Zijlstra wrote:

> > Even for the cgroup interface, I think they should set a per-task
> > property, not a group property.
> 
> Ok, right now using CGroups ans primary (and unique) interface, these
> values are tracked as attributes of the CPU controller. Tasks gets
> them by reading these attributes once they are binded to a CGroup.
> 
> Are you proposing to move these attributes within the task_struct?

/me goes look at your patches again, because I thought you already set
per task_struct values.

@@ -1531,6 +1531,9 @@ struct task_struct {
        struct sched_rt_entity rt;
 #ifdef CONFIG_CGROUP_SCHED
        struct task_group *sched_task_group;
+#ifdef CONFIG_CAPACITY_CLAMPING
+       struct rb_node cap_clamp_node[2];
+#endif

Yeah, see...

> In that case we should also defined a primary interface to set them,
> any preferred proposal? sched_setattr(), prctl?

We could, which I think is the important point.

> By regular rb-tree do you mean the cfs_rq->tasks_timeline?

Yep.

> Because in that case this would apply only to the FAIR class, while
> the rb-tree we are using here are across classes.
> Supporting both FAIR and RT I think is a worth having feature.

*groan* I don't want to even start thinking what this feature means in
the context of RT, head hurts enough.

> > So the bigger point is that if the min/max is a per-task property (even
> > if set through a cgroup interface), the min(max) / max(min) thing is
> > wrong.
> 
> Perhaps I'm not following you here but, being per-task does not mean
> that we need to aggregate these constraints by summing them (look
> below)...
>
> > If the min/max were to apply to each individual task's util, you'd end
> > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].
> 
> ... as you do here.
> 
> Let's use the usual simple example, where these per-tasks constraints
> are configured:
>
> - TaskA: capacity_min: 20% capacity_max: 80%
> - TaskB: capacity_min: 40% capacity_max: 60%
> 
> This means that, at CPU level, we want to enforce the following
> clamping depending on the tasks status:
> 
>  RUNNABLE tasks    capacity_min    capacity_max
> A) TaskA                      20%             80%
> B) TaskA,TaskB                40%             80%
> C) TaskB                      40%             60%
>  
> In case C, TaskA gets a bigger boost while is co-scheduled with TaskB.

(bit unfortunate you gave your cases and tasks the same enumeration)

But this I quite strongly feel is wrong. If you've given your tasks a
minimum OPP, you've in fact given them a minimum bandwidth, for at a
given frequency you can say how long they'll run, right?

So if you want to maintain that case B should be 60%. Once one of the
tasks completes it will drop again. That is, the increased value
represents the additional runnable 'load' over the min from the
currently running task. Combined they will still complete in reduced
time.

> Notice that this CPU-level aggregation is used just for OPP selection
> on that CPU, while for TaskA we still use capacity_min=20% when we are
> looking for a CPU.

And you don't find that inconsistent?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 14:43               ` Patrick Bellasi
@ 2017-04-12 16:14                 ` Peter Zijlstra
  2017-04-13 10:34                   ` Patrick Bellasi
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2017-04-12 16:14 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On Wed, Apr 12, 2017 at 03:43:10PM +0100, Patrick Bellasi wrote:
> On 12-Apr 16:34, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> > > On 12-Apr 14:48, Peter Zijlstra wrote:
> > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > > > >     illustrated per your above points in that it affects both, while in
> > > > > >     fact it actually modifies another metric, namely util_avg.
> > > > > 
> > > > > I don't see it modifying in any direct way util_avg.
> > > > 
> > > > The point is that clamps called 'capacity' are applied to util. So while
> > > > you don't modify util directly, you do modify the util signal (for one
> > > > consumer).
> > > 
> > > Right, but this consumer (i.e. schedutil) it's already translating
> > > the util_avg into a next_freq (which ultimately it's a capacity).
> > > 
> > > Thus, I don't see a big misfit in that code path to "filter" this
> > > translation with a capacity clamp.
> > 
> > Still strikes me as odd though.
> 
> Can you better elaborate on they why?

Because capacity is, as you pointed out earlier, a relative measure of
inter CPU performance (which isn't otherwise exposed to userspace
afaik).

While the utilization thing is a per task running signal.

There is no direct relation between the two.

The two main uses for the util signal are:

  OPP selection: the aggregate util of all runnable tasks for a
    particular CPU is used to select an OPP for said CPU [*], against
    whatever max-freq that CPU has. Capacity doesn't really come into play
    here.

  Task placement: capacity comes into play in so far that we want to
    make sure our task fits.


And I'm not at all sure we want to have both uses of our utilization
controlled by the one knob. They're quite distinct.


[*] yeah, I know clock domains with multiple CPUs in etc.. lets keep
this simple ;-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 16:14                 ` Peter Zijlstra
@ 2017-04-13 10:34                   ` Patrick Bellasi
  0 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-13 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 18:14, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 03:43:10PM +0100, Patrick Bellasi wrote:
> > On 12-Apr 16:34, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> > > > On 12-Apr 14:48, Peter Zijlstra wrote:
> > > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > > > > >     illustrated per your above points in that it affects both, while in
> > > > > > >     fact it actually modifies another metric, namely util_avg.
> > > > > > 
> > > > > > I don't see it modifying in any direct way util_avg.
> > > > > 
> > > > > The point is that clamps called 'capacity' are applied to util. So while
> > > > > you don't modify util directly, you do modify the util signal (for one
> > > > > consumer).
> > > > 
> > > > Right, but this consumer (i.e. schedutil) it's already translating
> > > > the util_avg into a next_freq (which ultimately it's a capacity).

                                                               ^^^^^^^^
                                                                [REF1]

> > > > 
> > > > Thus, I don't see a big misfit in that code path to "filter" this
> > > > translation with a capacity clamp.
> > > 
> > > Still strikes me as odd though.
> > 
> > Can you better elaborate on they why?
> 
> Because capacity is, as you pointed out earlier, a relative measure of
> inter CPU performance (which isn't otherwise exposed to userspace
> afaik).

Perhaps, since I'm biased by EAS concepts which are still not
mainline, I was not clear on specifying what I meant by "capacity" in
[REF1].

My fault, sorry, perhaps it's worth if I start by reviewing some
concepts and see if we can establish a common language.


.:: Mainline

If we look at mainline, "capacity" is actually a concept used to
represent the computational bandwidth available in a CPU, when running
at the highest OPP (let's consider SMP systems to keep it simple).

But things are already a bit more complicated. Specifically, looking
at update_cpu_capacity(), we distinguish between:

- cpu_rq(cpu)->cpu_capacity_orig
  which is the bandwidth available at the max OPP.

- cpu_rq(cpu)->cpu_capacity
  which discounts from the previous metrics the "average" bandwidth used
  by RT tasks, but not (yet) DEADLINE tasks afaics.

Thus, "capacity" is already a polymorphic concept:
  we use cpu_capacity_orig to cap the cpu utilization of CFS tasks
  in cpu_util()
but
  this cpu utilization is a signal which converge to "current capacity"
  in  ___update_load_avg()

The "current capacity" (capacity_curr, but just in some comments) is actually
the computational bandwidth available at a certain OPP.

Thus, we already have in mainline a concepts of capacity which refers to the
bandwidth available in a certain OPP. The "current capacity" is what we
ultimately use to scale PELT depending on the current OPP.


.:: EAS

Looking at EAS, and specifically the energy model, we describe each
OPP using a:

  struct capacity_state {
          unsigned long cap;      /* compute capacity */
          unsigned long power;    /* power consumption at this compute capacity */
  };

Where again we find a usage of the "current capacity", i.e. the
computational bandwidth available at each OPP.


.:: Current Capacity

In [REF1] I was referring to the concept of "current capacity", which is what
schedutil is after. There we need translate cfs.avg.util_avg into an OPP, which
ultimately is a suitable level of "current capacity" to satisfy the
CPU bandwidth requested by CFS tasks.

> While the utilization thing is a per task running signal.

Which still is converging to the "current capacity", at least before
Vincent's patches.

> There is no direct relation between the two.

Give the previous definitions, can we say that there is a relation between task
utilization and "current capacity"?

     Sum(task_utilization) = cpu_utilization
            <= "current capacity" (cpufreq_schedutil::get_next_freq())    [1]
            <= cpu_capacity_orig

> The two main uses for the util signal are:
> 
>   OPP selection: the aggregate util of all runnable tasks for a
>     particular CPU is used to select an OPP for said CPU [*], against
>     whatever max-freq that CPU has. Capacity doesn't really come into play
>     here.

The OPP selected has to provide a suitable amount of "current capacity" to
accommodate the required utilization.

>   Task placement: capacity comes into play in so far that we want to
>     make sure our task fits.

This two usages are not completely independent, at least when EAS is
in use. In EAS we can evaluate/compare scenarios like:

  "should I increase the capacity of CPUx or wakeup CPUy"

Thus, we use capacity indexes to estimate energy deltas by
moving a task and, by consequence, changing a CPU's OPP.

Which means: expected "capacity" variations are affecting OPP selections.

> And I'm not at all sure we want to have both uses of our utilization
> controlled by the one knob. They're quite distinct.

The proposed knobs, for example capacity_min, are used to clamp the
scheduler/schedutil view on what is the required "current capacity" by
modifying the previous relation [1] to be:

                 Sum(task_utilization) = cpu_utilization
            clamp(cpu_utilization, capacity_min, capacity_max)
                        <= "current capacity"
                        <= cpu_capacity_orig

In [1] we already have a transformation from the cpu_utilization
domain to the "current capacity" domain. Here we are just adding a
clamping filter around that transformation.

I hope this is useful to find some common ground, perhaps the naming
capacity_{min,max} is unfortunate and we can find a better one.
However, we should first agree on the utility of the proposed
clamping concept... ;-)

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC v3 0/5] Add capacity capping support to the CPU controller
  2017-04-12 15:37             ` Peter Zijlstra
@ 2017-04-13 11:33               ` Patrick Bellasi
  0 siblings, 0 replies; 66+ messages in thread
From: Patrick Bellasi @ 2017-04-13 11:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, linux-pm, Ingo Molnar,
	Rafael J . Wysocki, Paul Turner, Vincent Guittot, John Stultz,
	Todd Kjos, Tim Murray, Andres Oportus, Joel Fernandes,
	Juri Lelli, Chris Redpath, Morten Rasmussen, Dietmar Eggemann

On 12-Apr 17:37, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 02:55:38PM +0100, Patrick Bellasi wrote:
> > On 12-Apr 14:10, Peter Zijlstra wrote:
> 
> > > Even for the cgroup interface, I think they should set a per-task
> > > property, not a group property.
> > 
> > Ok, right now using CGroups ans primary (and unique) interface, these
> > values are tracked as attributes of the CPU controller. Tasks gets
> > them by reading these attributes once they are binded to a CGroup.
> > 
> > Are you proposing to move these attributes within the task_struct?
> 
> /me goes look at your patches again, because I thought you already set
> per task_struct values.
> 
> @@ -1531,6 +1531,9 @@ struct task_struct {
>         struct sched_rt_entity rt;
>  #ifdef CONFIG_CGROUP_SCHED
>         struct task_group *sched_task_group;
> +#ifdef CONFIG_CAPACITY_CLAMPING
> +       struct rb_node cap_clamp_node[2];
> +#endif
> 
> Yeah, see...

Well, these are not the actual attributes.

These rb_nodes are used to sort the tasks based on their constraints, but the
actual attributes values are stored in:

@@ -273,6 +273,14 @@ struct task_group {
 #endif
 #endif
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+#define CAP_CLAMP_MIN 0
+#define CAP_CLAMP_MAX 1
+
+       /* Min and Max capacity constraints for tasks in this group */
+       unsigned int cap_clamp[2];
+#endif
+

This is done to avoid replicated information in each tasks structure.

> > In that case we should also defined a primary interface to set them,
> > any preferred proposal? sched_setattr(), prctl?
> 
> We could, which I think is the important point.
> 
> > By regular rb-tree do you mean the cfs_rq->tasks_timeline?
> 
> Yep.
> 
> > Because in that case this would apply only to the FAIR class, while
> > the rb-tree we are using here are across classes.
> > Supporting both FAIR and RT I think is a worth having feature.
> 
> *groan* I don't want to even start thinking what this feature means in
> the context of RT, head hurts enough.

:-)

Still, mobile people *groan* when we go to max OPP every time a RT task runs.
Here you can find some energy numbers I've got recently on Pixel phones:
   https://lkml.org/lkml/2017/3/17/214
7%-54% (useless) more energy is a big deal.

Of course, there can be many different solution to this problem, but
capacity_max allows to easily clamp the frequency used when certain RT
while still keeping them within expected latency performance.

> > > So the bigger point is that if the min/max is a per-task property (even
> > > if set through a cgroup interface), the min(max) / max(min) thing is
> > > wrong.
> > 
> > Perhaps I'm not following you here but, being per-task does not mean
> > that we need to aggregate these constraints by summing them (look
> > below)...
> >
> > > If the min/max were to apply to each individual task's util, you'd end
> > > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].
> > 
> > ... as you do here.
> > 
> > Let's use the usual simple example, where these per-tasks constraints
> > are configured:
> >
> > - TaskA: capacity_min: 20% capacity_max: 80%
> > - TaskB: capacity_min: 40% capacity_max: 60%
> > 
> > This means that, at CPU level, we want to enforce the following
> > clamping depending on the tasks status:
> > 
> >  RUNNABLE tasks    capacity_min    capacity_max
> > A) TaskA                      20%             80%
> > B) TaskA,TaskB                40%             80%
> > C) TaskB                      40%             60%
> >  
> > In case C, TaskA gets a bigger boost while is co-scheduled with TaskB.
> 
> (bit unfortunate you gave your cases and tasks the same enumeration)
> 
> But this I quite strongly feel is wrong. If you've given your tasks a
> minimum OPP, you've in fact given them a minimum bandwidth, for at a
> given frequency you can say how long they'll run, right?

Not really, we are still in the domain of a best-effort solution, and
I think we should stick with that.

The overall idea is not about allocating bandwidth at all, but instead
on expressing preferences, and there are two main reasons:

1) in principle we don't know how long a CFS task will run.
   we just know that if it completes faster than it's better

   Think about a task which is relatively small but functional to
   trigger further processing on an external device (e.g. a GPU).
   In this case the task is part of a critical-path and the sooner it
   finished the better it is.
   It can be the case that allocating bandwidth for such a task is not
   easy, e.g. because the amout of processing the task does can
   sensible change between each activation.

   In this case you have two options:
   a) meaure/estimate the WCET and go for over-budgeting
      likely using DEADLINE
   b) find the minimum capacity which allows your task to complete
      reasonably fast most of the time

   The second is of course a best-effort approach, still I find it
   could be useful to have and it can be easily adapted at run-time to
   express a sort-of power-vs-performance trade-off.

2) if you really want a granted bandwidth, you quite likely want also
   a granted deadline... and you should go for DEADLINE.

> So if you want to maintain that case B should be 60%. Once one of the
> tasks completes it will drop again. That is, the increased value
> represents the additional runnable 'load' over the min from the
> currently running task. Combined they will still complete in reduced
> time.

We already experimented with this approach in the past, actually the
first version of SchedTune was based on the idea to aggregate by
adding the boosted utilizations.

It's true that in that way we are more likely to speed-up tasks
completion also in case of co-scheduling but the downside is that we
are entering the domain of "granted bandwidth allocation" which is
likely overkilling for the scope of a best-effort solution.

Moreover, since bandwidth is a limited resource, we also found such an
approach not fitting well for systems where you have a certain number
of tasks running concurrently. While, a simple threshold based
boosting, where max() is used as the aggregation function, seems to be
still useful.


> > Notice that this CPU-level aggregation is used just for OPP selection
> > on that CPU, while for TaskA we still use capacity_min=20% when we are
> > looking for a CPU.
> 
> And you don't find that inconsistent?

In the previous example, TaskB seems to prefer a CPU which has between
40% and 60% capacity.
Let's assume these numbers comes from a use-case where:
 a) your system provides 60% capacity in a LITTLE CPU
 b) your are after "sustained performances" for TaskA, which on that
    platform can be easily achieved by running at 40% of a LITTLE CPU

Don't you think that this can be a valuable information for the
scheduler to just (possibly) prefer a LITTLE CPU?

With a max() aggregation we can place both TaskA and TaskB on the
LITTLE CPU and try to run them at least at 40% capacity.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2017-04-13 11:33 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 14:38 [RFC v3 0/5] Add capacity capping support to the CPU controller Patrick Bellasi
2017-02-28 14:38 ` [RFC v3 1/5] sched/core: add capacity constraints to " Patrick Bellasi
2017-03-13 10:46   ` Joel Fernandes (Google)
2017-03-15 11:20     ` Patrick Bellasi
2017-03-15 13:20       ` Joel Fernandes
2017-03-15 16:10         ` Paul E. McKenney
2017-03-15 16:44           ` Patrick Bellasi
2017-03-15 17:24             ` Paul E. McKenney
2017-03-15 17:57               ` Patrick Bellasi
2017-03-20 17:15   ` Tejun Heo
2017-03-20 17:36     ` Tejun Heo
2017-03-20 18:08     ` Patrick Bellasi
2017-03-23  0:28       ` Joel Fernandes (Google)
2017-03-23 10:32         ` Patrick Bellasi
2017-03-23 16:01           ` Tejun Heo
2017-03-23 18:15             ` Patrick Bellasi
2017-03-23 18:39               ` Tejun Heo
2017-03-24  6:37                 ` Joel Fernandes (Google)
2017-03-24 15:00                   ` Tejun Heo
2017-03-30 21:13                 ` Paul Turner
2017-03-24  7:02           ` Joel Fernandes (Google)
2017-03-30 21:15       ` Paul Turner
2017-04-01 16:25         ` Patrick Bellasi
2017-02-28 14:38 ` [RFC v3 2/5] sched/core: track CPU's capacity_{min,max} Patrick Bellasi
2017-02-28 14:38 ` [RFC v3 3/5] sched/core: sync capacity_{min,max} between slow and fast paths Patrick Bellasi
2017-02-28 14:38 ` [RFC v3 4/5] sched/{core,cpufreq_schedutil}: add capacity clamping for FAIR tasks Patrick Bellasi
2017-02-28 14:38 ` [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks Patrick Bellasi
2017-03-13 10:08   ` Joel Fernandes (Google)
2017-03-15 11:40     ` Patrick Bellasi
2017-03-15 12:59       ` Joel Fernandes
2017-03-15 14:44         ` Juri Lelli
2017-03-15 16:13           ` Joel Fernandes
2017-03-15 16:24             ` Juri Lelli
2017-03-15 23:40               ` Joel Fernandes
2017-03-16 11:16                 ` Juri Lelli
2017-03-16 12:27                   ` Patrick Bellasi
2017-03-16 12:44                     ` Juri Lelli
2017-03-16 16:58                       ` Joel Fernandes
2017-03-16 17:17                         ` Juri Lelli
2017-03-15 11:41 ` [RFC v3 0/5] Add capacity capping support to the CPU controller Rafael J. Wysocki
2017-03-15 12:59   ` Patrick Bellasi
2017-03-16  1:04     ` Rafael J. Wysocki
2017-03-16  3:15       ` Joel Fernandes
2017-03-20 22:51         ` Rafael J. Wysocki
2017-03-21 11:01           ` Patrick Bellasi
2017-03-24 23:52             ` Rafael J. Wysocki
2017-03-16 12:23       ` Patrick Bellasi
2017-03-20 14:51 ` Tejun Heo
2017-03-20 17:22   ` Patrick Bellasi
2017-04-10  7:36     ` Peter Zijlstra
2017-04-11 17:58       ` Patrick Bellasi
2017-04-12 12:10         ` Peter Zijlstra
2017-04-12 13:55           ` Patrick Bellasi
2017-04-12 15:37             ` Peter Zijlstra
2017-04-13 11:33               ` Patrick Bellasi
2017-04-12 12:15         ` Peter Zijlstra
2017-04-12 13:34           ` Patrick Bellasi
2017-04-12 14:41             ` Peter Zijlstra
2017-04-12 12:22         ` Peter Zijlstra
2017-04-12 13:24           ` Patrick Bellasi
2017-04-12 12:48         ` Peter Zijlstra
2017-04-12 13:27           ` Patrick Bellasi
2017-04-12 14:34             ` Peter Zijlstra
2017-04-12 14:43               ` Patrick Bellasi
2017-04-12 16:14                 ` Peter Zijlstra
2017-04-13 10:34                   ` Patrick Bellasi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).