[PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting
@ 2018-02-01 16:51 Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

This is the follow-up patchset to [1] that attempt to fix a problem
reported by Steve Rostedt [2] where DL bandwidth accounting is not
recomputed after CPUset and CPU hotplug operations.  When CPU hotplug and
some CUPset manipulation take place root domains are destroyed and new ones
created, loosing at the same time DL accounting information pertaining to
utilisation.  Please see [1] for a full description of the approach.

In this revision a shortcoming identified by Luca is addressed, with most
of the solution kept unchanged.

A notable addition is patch 7/7 - it addresses a problem seen when hot
plugging out a CPU where a DL task is running (see changelog for full
details).  The issue is unrelated to this patchset and will manifest
itself on a mainline kernel.

I will start working on that problem once done with this set but lumping
it in here to raise awareness and provide a stop-gap measure while a
better solution is designed.  This set is also available here [3] with the
instrumentation for patch 7/7 in this commit [4].

This set applies cleanly on top of v4.15.

Best regards,
Mathieu

------
Change for V2:
. Addressing a problem found by Luca Abeni where the mask of a DL task
  isn't modified when cpuset are collapsed.

[1]. https://groups.google.com/forum/#!topic/linux.kernel/uakbvOQE6rc
[2]. https://lkml.org/lkml/2016/2/3/966
[3]. https://git.linaro.org/people/mathieu.poirier/linux.git/log/?h=v4.15-bandwidth-accounting-v2
[4]. af68563a6c21 ("sched/debug: Add 'rq_debug' proc entry")

Mathieu Poirier (7):
  sched/topology: Adding function partition_sched_domains_locked()
  cpuset: Rebuild root domain deadline accounting information
  sched/deadline: Keep new DL task within root domain's boundary
  cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
  cgroup: Constrain the addition of CPUs to a new CPUset
  sched/core: Don't change the affinity of DL tasks
  sched/deadline: Prevent CPU hotplug operation if DL task on CPU

 include/linux/cpuset.h         |   6 ++
 include/linux/sched.h          |   3 +
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |   9 ++
 kernel/cgroup/cpuset.c         | 232 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c            |  32 +++++-
 kernel/sched/deadline.c        |  36 +++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  31 +++++-
 9 files changed, 346 insertions(+), 14 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-02 10:19   ` Juri Lelli
  2018-02-01 16:51 ` [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information Mathieu Poirier
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

This patch doesn't change the functionality provided by the
original code.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched/topology.h |  9 +++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cf257c2e728d..118141c3216a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -162,6 +162,9 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -207,6 +210,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 struct sched_domain_attr;
 
 static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
+static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 41bf2a531ee5..892e1f9c78f0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1842,15 +1842,15 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -1915,7 +1915,16 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-02 12:52   ` Juri Lelli
  2018-02-01 16:51 ` [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary Mathieu Poirier
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch address the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched.h          |  3 ++
 include/linux/sched/deadline.h |  8 ++++++
 kernel/cgroup/cpuset.c         | 63 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 31 +++++++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 14 +++++++++-
 6 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2588263a989..18d00914ae6c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -239,6 +239,9 @@ struct vtime {
 	u64			gtime;
 };
 
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index a5bc8728ead7..050961659b1d 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -29,4 +29,12 @@ static inline bool dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
+
 #endif /* _LINUX_SCHED_DEADLINE_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fbf40cbd37c9..fc5c709f99cf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -812,6 +813,66 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+/*
+ * Called with cpuset_mutex held (rebuild_sched_domains())
+ * Called with hotplug lock held (rebuild_sched_domains_locked())
+ * Called with sched_domains_mutex held (partition_and_rebuild_domains())
+ */
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -844,7 +905,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2473736c7616..1a5c42e7b076 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2208,6 +2208,37 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+	dl_b->total_bw += p->dl.dl_bw;
+
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a212de..d369b748c5a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -659,9 +659,6 @@ struct root_domain {
 	unsigned long max_cpu_capacity;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 892e1f9c78f0..371468593f4a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -3,6 +3,7 @@
  * Scheduler topology setup/handling methods
  */
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/mutex.h>
 #include <linux/sched/isolation.h>
 #include <linux/proc_fs.h>
@@ -1875,8 +1876,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-02 14:35   ` Juri Lelli
  2018-02-01 16:51 ` [PATCH V2 4/7] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present Mathieu Poirier
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

When considering to move a task to the DL policy we need to make sure
the CPUs it is allowed to run on matches the CPUs of the root domains of
the runqueue it is currently assigned to.  Otherwise the task will be
allowed to roam on CPUs outside of this root domain, something that will
skew system deadline statistics and potentially lead to over selling DL
bandwidth.

For example say we have a 4 core system split in 2 cpuset: set1 has CPU 0
and 1 while set2 has CPU 2 and 3.  This results in 3 cpuset - the default
set that has all 4 CPUs along with set1 and set2 as just depicted.  We also
have task A that hasn't been assigned to any CPUset and as such, is part of
the default CPUset.

At the time we want to move task A to a DL policy it has been assigned to
CPU1.  Since CPU1 is part of set1 the root domain will have 2 CPUs in it
and the bandwidth constraint checked against the current DL bandwidth
allotment of those 2 CPUs.

If task A is promoted to a DL policy it's 'cpus_allowed' mask is still
equal to the CPUs in the default CPUset, making it possible for the
scheduler to move it to CPU2 and CPU3, which could also be running DL tasks
of their own.

This patch makes sure that a task's cpus_allowed mask matches the CPUs
in the root domain associated to the runqueue it has been assigned to.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 23 +++++++++++++++++++++++
 kernel/sched/core.c    | 22 ++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 1b8e41597ef5..61a405ffc3b1 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -57,6 +57,7 @@ extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern bool cpuset_cpus_match_task(struct task_struct *tsk);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -186,6 +187,11 @@ static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
 {
 }
 
+bool cpuset_cpus_match_task(struct task_struct *tsk)
+{
+	return true;
+}
+
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
 	return node_possible_map;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fc5c709f99cf..6942c4652f31 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2517,6 +2517,29 @@ void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	 */
 }
 
+/**
+ * cpuset_cpus_match_task - return whether a task's cpus_allowed mask matches
+ * that of the cpuset it is assigned to.
+ * @tsk: pointer to the task_struct from which tsk->cpus_allowd is obtained.
+ *
+ * Description: Returns 'true' if the cpus_allowed mask of a task is the same
+ * as the cpus_allowed of the cpuset the task belongs to.  This is useful in
+ * situation where both cpuset and DL tasks are used.
+ */
+bool cpuset_cpus_match_task(struct task_struct *tsk)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&callback_lock, flags);
+	rcu_read_lock();
+	ret = cpumask_equal((task_cs(tsk))->cpus_allowed, &tsk->cpus_allowed);
+	rcu_read_unlock();
+	spin_unlock_irqrestore(&callback_lock, flags);
+
+	return ret;
+}
+
 void __init cpuset_init_current_mems_allowed(void)
 {
 	nodes_setall(current->mems_allowed);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32aabfda..1a64aad1b9dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4188,6 +4188,28 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	/*
+	 * If setscheduling to SCHED_DEADLINE we need to make sure the task
+	 * is constrained to run within the root domain it is associated with,
+	 * something that isn't guaranteed when using cpusets.
+	 *
+	 * Speaking of cpusets, we also need to assert that a task's
+	 * cpus_allowed mask equals its cpuset's cpus_allowed mask. Otherwise
+	 * a DL task could be assigned to a cpuset that has more CPUs than the
+	 * root domain it is associated with, a situation that yields no
+	 * benefits and greatly complicate the management of DL task when
+	 * cpusets are present.
+	 */
+	if (dl_policy(policy)) {
+		struct root_domain *rd = cpu_rq(task_cpu(p))->rd;
+
+		if (!cpumask_equal(&p->cpus_allowed, rd->span) ||
+		    !cpuset_cpus_match_task(p)) {
+			task_rq_unlock(rq, p, &rf);
+			return -EBUSY;
+		}
+	}
+
+	/*
 	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
 	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
 	 * is available.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 4/7] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (2 preceding siblings ...)
  2018-02-01 16:51 ` [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 5/7] cgroup: Constrain the addition of CPUs to a new CPUset Mathieu Poirier
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

This patch prevents the 'sched_load_balance' flag from being set to 0
when DL tasks are present in a CPUset.  Otherwise we end up with the
DL tasks using CPUs belonging to different root domains, something that
breaks the mathematical model behind DL bandwidth management.

For example on a 4 core system CPUset "set1" has been created and CPUs
0 and 1 assigned to it.  A DL task has also been spun off.  By default
the DL task can use all the CPUs in the default CPUset.

If we set the base CPUset's cpuset.sched_load_balance to 0, CPU 0 and 1
are added to a newly created root domain while CPU 2 and 3 endup in the
default root domain.  But the DL task is still part of the base CPUset and
as such can use CPUs 0 to 3, spanning at the same time more than one root
domain.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/cgroup/cpuset.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6942c4652f31..daa1b2bc7e11 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -458,6 +458,106 @@ static void free_trial_cpuset(struct cpuset *trial)
 	kfree(trial);
 }
 
+static bool cpuset_has_dl_tasks(struct cpuset *cs)
+{
+	bool dl_tasks = false;
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	/* Go through each task in @cs looking for a DL task */
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while (!dl_tasks && (task = css_task_iter_next(&it))) {
+		if (dl_task(task))
+			dl_tasks = true;
+	}
+
+	css_task_iter_end(&it);
+
+	return dl_tasks;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int
+validate_change_load_balance(struct cpuset *cur, struct cpuset *trial)
+{
+	bool populated = false, dl_tasks = false;
+	int ret = -EBUSY;
+	struct cgroup_subsys_state *pos_css;
+	struct cpuset *cs;
+
+	 /* Bail out if nothing has changed. */
+	if (is_sched_load_balance(cur) ==
+	    is_sched_load_balance(trial)) {
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * First deal with the generic case that applies when
+	 * cpuset.sched_load_balance gets flipped on a cpuset,
+	 * regardless of the value.
+	 */
+	cpuset_for_each_descendant_pre(cs, pos_css, cur) {
+		if (cpuset_has_dl_tasks(cs))
+			dl_tasks = true;
+
+		/* Skip the top cpuset since it obviously exists */
+		if (cs == cur)
+			continue;
+
+		/* Children without CPUs are not important */
+		if (cpumask_empty(cs->cpus_allowed)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		/* CPUs have been assigned to this cpuset. */
+		populated = true;
+
+		/*
+		 * Go no further if both conditions are true so that we
+		 * don't end up in a situation where a DL task is
+		 * spanning more than one root domain or only assigned
+		 * to a subset of the CPUs in a root domain.
+		 */
+		if (populated && dl_tasks)
+			goto out;
+	}
+
+	/*
+	 * Things get very complicated when dealing with children cpuset,
+	 * resulting in hard to maintain code and low confidence that
+	 * all cases are handled properly.  As such prevent the
+	 * cpuset.sched_load_balance from being modified on children cpuset
+	 * where DL tasks have been assigned (or any of its children).
+	 */
+	if (dl_tasks && parent_cs(cur))
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int
+validate_dl_change(struct cpuset *cur, struct cpuset *trial)
+{
+	int ret = 0;
+
+	/* Check if the sched_load_balance flag has been changed */
+	ret = validate_change_load_balance(cur, trial);
+	if (ret)
+		return ret;
+
+	return ret;
+}
+
 /*
  * validate_change() - Used to validate that any proposed cpuset change
  *		       follows the structural rules for cpusets.
@@ -492,6 +592,10 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 		if (!is_cpuset_subset(c, trial))
 			goto out;
 
+	/* Make sure changes are compatible with deadline scheduling class */
+	if (validate_dl_change(cur, trial))
+		goto out;
+
 	/* Remaining checks don't apply to root cpuset */
 	ret = 0;
 	if (cur == &top_cpuset)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 5/7] cgroup: Constrain the addition of CPUs to a new CPUset
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (3 preceding siblings ...)
  2018-02-01 16:51 ` [PATCH V2 4/7] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 6/7] sched/core: Don't change the affinity of DL tasks Mathieu Poirier
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

Care must be taken when CPUs are added to a new CPUset.  If an ancestor
of that set has its sched_load_balance flag switch off then the CPUs in
the new CPUset will be added to a new root domain.  If the ancestor also
had DL tasks those will end up covering more than one root domain,
breaking at the same time the DL integrity model.

This patch prevents adding CPUs to a new CPUset if one of its ancestor
had its sched_load_balance flag off and had DL tasks assigned to it.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/cgroup/cpuset.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index daa1b2bc7e11..6ae4ad995e9e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -481,6 +481,43 @@ static bool cpuset_has_dl_tasks(struct cpuset *cs)
  * Assumes RCU read lock and cpuset_mutex are held.
  */
 static int
+validate_change_cpus(struct cpuset *cur, struct cpuset *trial)
+{
+	int ret = 0;
+
+	/*
+	 * CPUs are being added to a CPUset.  If any parent of @trial has its
+	 * sched_load_balance flag switched off this operation will create a
+	 * new root domain spanning trial->cpus_allowed. At the same time
+	 * if any parent of @trial has a DL task, that task will end up
+	 * spanning more than one root domain and break the deadline integrity
+	 * model.
+	 */
+	if (cpumask_weight(cur->cpus_allowed) <
+	    cpumask_weight(trial->cpus_allowed)) {
+		struct cpuset *parent;
+
+		parent = parent_cs(trial);
+		/* Go up until we reach the top_cpuset */
+		while (parent) {
+			if (cpuset_has_dl_tasks(parent) &&
+			    !is_sched_load_balance(parent)) {
+				ret = -EBUSY;
+				goto out;
+			}
+
+			parent = parent_cs(parent);
+		}
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int
 validate_change_load_balance(struct cpuset *cur, struct cpuset *trial)
 {
 	bool populated = false, dl_tasks = false;
@@ -553,8 +590,13 @@ validate_dl_change(struct cpuset *cur, struct cpuset *trial)
 	/* Check if the sched_load_balance flag has been changed */
 	ret = validate_change_load_balance(cur, trial);
 	if (ret)
-		return ret;
+		goto out;
 
+	ret = validate_change_cpus(cur, trial);
+	if (ret)
+		goto out;
+
+out:
 	return ret;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 6/7] sched/core: Don't change the affinity of DL tasks
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (4 preceding siblings ...)
  2018-02-01 16:51 ` [PATCH V2 5/7] cgroup: Constrain the addition of CPUs to a new CPUset Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-01 16:51 ` [PATCH V2 7/7] sched/deadline: Prevent CPU hotplug operation if DL task on CPU Mathieu Poirier
  2018-02-02 13:17 ` [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Luca Abeni
  7 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

Now that we mandate that on creation, the ->cpus_allowed mask of a future
DL task has to be equal to the rd->span of the root domain it will be
associated with, changing the affinity of a DL task doesn't make sense
anymore.

Indeed, if we set the task to a smaller affinity set then we may be running
out of CPU cycle.  If we try to increase the size of the affinity set the
new CPUs are necessarily part of another root domain where DL utilisation
for the task won't be accounted for.

As such simply refuse to change the affinity set of a DL task.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a64aad1b9dc..2c9e54aac3cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4694,15 +4694,15 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 
 	/*
-	 * Since bandwidth control happens on root_domain basis,
-	 * if admission test is enabled, we only admit -deadline
-	 * tasks allowed to run on all the CPUs in the task's
-	 * root_domain.
+	 * Since bandwidth control happens on root_domain basis, if admission
+	 * test is enabled we don't allow the task' CPUs to change.  If
+	 * @new_mask is smaller than we risk running out of cycles.  If it is
+	 * bigger than we may be using DL bandwidth allocated to other CPUs.
 	 */
 #ifdef CONFIG_SMP
 	if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
 		rcu_read_lock();
-		if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
+		if (!cpumask_equal(task_rq(p)->rd->span, new_mask)) {
 			retval = -EBUSY;
 			rcu_read_unlock();
 			goto out_free_new_mask;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 7/7] sched/deadline: Prevent CPU hotplug operation if DL task on CPU
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (5 preceding siblings ...)
  2018-02-01 16:51 ` [PATCH V2 6/7] sched/core: Don't change the affinity of DL tasks Mathieu Poirier
@ 2018-02-01 16:51 ` Mathieu Poirier
  2018-02-02 13:17 ` [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Luca Abeni
  7 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-01 16:51 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, linux-kernel

When a DL task is assigned a CPU the "utilisation" (this_bw) and the
"active utilisation" (running_bw) fields of rq->dl are incremented
accordingly.  If the CPU is hotplugged out the DL task is transferred
to another CPU but the task's contribution to this_bw and running_bw
isn't substracted from the outgoing CPU's rq nor added to the newly
appointed CPU.

In this example (where a kernel has been instrumented to output the
relevant information) we have a 4 CPU system with one 6:10 DL task that
has been assigned to CPU 2:

	root@dragon:/home/linaro/# cat /proc/rq_debug

	dl_rq[0]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 3984588
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[1]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span:			: 0-3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 3984588	<-- RD capacity for 4 CPUs
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[2]:
	  .online			: yes
	  .dl_nr_running		: 1		<-- One task running
	  .running_bw			: 629145	<-- Normal behavior
	  .this_bw			: 629145	<-- Normal behavior
	  .rd->span			: 0-3
	  .dl_nr_migratory		: 1
	  .rd->dl_bw->bw		: 3984588
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[3]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 3984588
	  .rd->dl_bw->total_bw		: 629145

At this point we hotplug out CPU2 and list the status again:

root@dragon:/home/linaro/# echo  0 > /sys/devices/system/cpu/cpu2/online
root@dragon:/home/linaro/# cat /proc/rq_debug

	dl_rq[0]:
	  .online			: yes
	  .dl_nr_running		: 1		<-- DL task was moved here
	  .running_bw			: 0		<-- Contribution not added
	  .this_bw			: 0		<-- Contribution not added
	  .rd->span			: 0-1,3
	  .dl_nr_migratory		: 1
	  .rd->dl_bw->bw		: 2988441	<-- RD capacity updated
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[1]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-1,3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 2988441
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[2]:
	  .online			: no		<-- runqueue no longer online
	  .dl_nr_running		: 0		<-- DL task was moved
	  .running_bw			: 629145	<-- Contribution not substracted
	  .this_bw			: 629145	<-- Contribution not substracted
	  .rd->span			: 2
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 996147
	  .rd->dl_bw->total_bw		: 0

	dl_rq[3]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-1,3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 2988441
	  .rd->dl_bw->total_bw: 629145

Upon rebooting the system a splat is also produced:

[  578.184789] ------------[ cut here ]------------
[  578.184813] dl_rq->running_bw > old
[  578.184838] WARNING: CPU: 0 PID: 4076 at /home/mpoirier/work/linaro/deadline/kernel/kernel/sched/deadline.c:98 dequeue_task_dl+0x128/0x168
[  578.191693] Modules linked in:
[  578.191705] CPU: 0 PID: 4076 Comm: burn Not tainted 4.15.0-00009-gf597fc1e5764-dirty #259
[  578.191708] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[  578.191713] pstate: 60000085 (nZCv daIf -PAN -UAO)
[  578.191718] pc : dequeue_task_dl+0x128/0x168
[  578.191722] lr : dequeue_task_dl+0x128/0x168
[  578.191724] sp : ffff8000383ebbf0
[  578.191727] x29: ffff8000383ebbf0 x28: ffff800038288000
[  578.191733] x27: 0000000000000009 x26: ffff800038890000
[  578.191739] x25: ffff800038994e60 x24: ffff800038994e00
[  578.191744] x23: 0000000000000000 x22: 0000000000000000
[  578.191749] x21: 000000000000000e x20: ffff800038288000
[  578.191755] x19: ffff80003d950aa8 x18: 0000000000000010
[  578.191761] x17: 0000000000000001 x16: 0000000000002710
[  578.191766] x15: 0000000000000006 x14: ffff0000892ed37f
[  578.191772] x13: ffff0000092ed38d x12: 0000000000000000
[  578.191778] x11: ffff8000383eb840 x10: 0000000005f5e0ff
[  578.191784] x9 : 0000000000000034 x8 : 625f676e696e6e75
[  578.191794] x7 : 723e2d71725f6c64 x6 : 000000000000016c
[  578.191800] x5 : 0000000000000000 x4 : 0000000000000000
[  578.191806] x3 : ffffffffffffffff x2 : 000080003480f000
[  578.191812] x1 : ffff800038288000 x0 : 0000000000000017
[  578.191818] Call trace:
[  578.191824]  dequeue_task_dl+0x128/0x168
[  578.191830]  sched_move_task+0xa8/0x150
[  578.191837]  sched_autogroup_exit_task+0x20/0x30
[  578.191843]  do_exit+0x2c4/0x9f8
[  578.191847]  do_group_exit+0x3c/0xa0
[  578.191853]  get_signal+0x2a4/0x568
[  578.191860]  do_signal+0x70/0x210
[  578.191866]  do_notify_resume+0xe0/0x138
[  578.191870]  work_pending+0x8/0x10
[  578.191874] ---[ end trace 345388d10dc698fe ]---

As a stop-gap measure before the real solution is available this patch
prevents users from carrying out a CPU hotplug operation if a DL task is
running (or suspended) on said CPU.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/deadline.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1a5c42e7b076..9fa9421bb23f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2672,11 +2672,16 @@ bool dl_cpu_busy(unsigned int cpu)
 	int cpus;
 
 	rcu_read_lock_sched();
+	overflow = !!(cpu_rq(cpu)->dl.this_bw);
+	if (overflow)
+		goto out;
+
 	dl_b = dl_bw_of(cpu);
 	raw_spin_lock_irqsave(&dl_b->lock, flags);
 	cpus = dl_bw_cpus(cpu);
 	overflow = __dl_overflow(dl_b, cpus, 0, 0);
 	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+out:
 	rcu_read_unlock_sched();
 	return overflow;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2018-02-01 16:51 ` [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
@ 2018-02-02 10:19   ` Juri Lelli
  2018-02-05 18:11     ` Mathieu Poirier
  0 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2018-02-02 10:19 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, luca.abeni, linux-kernel

Hi Mathieu,

On 01/02/18 09:51, Mathieu Poirier wrote:
> Introducing function partition_sched_domains_locked() by taking
> the mutex locking code out of the original function.  That way
> the work done by partition_sched_domains_locked() can be reused
> without dropping the mutex lock.
> 
> This patch doesn't change the functionality provided by the
> original code.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> ---

[...]

> +/*
> + * Call with hotplug lock held

Is this the one that we can actually check if it's locked with

lockdep_assert_cpus_held()

?

> + */
> +void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> +			     struct sched_domain_attr *dattr_new)
> +{
> +	mutex_lock(&sched_domains_mutex);
> +	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
>  	mutex_unlock(&sched_domains_mutex);
>  }

Best,

- Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information
  2018-02-01 16:51 ` [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information Mathieu Poirier
@ 2018-02-02 12:52   ` Juri Lelli
  2018-02-05 18:59     ` Mathieu Poirier
  0 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2018-02-02 12:52 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, luca.abeni, linux-kernel

Hi Mathieu,

On 01/02/18 09:51, Mathieu Poirier wrote:
> When the topology of root domains is modified by CPUset or CPUhotplug
> operations information about the current deadline bandwidth held in the
> root domain is lost.
> 
> This patch address the issue by recalculating the lost deadline
> bandwidth information by circling through the deadline tasks held in
> CPUsets and adding their current load to the root domain they are
> associated with.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> ---

[...]

> +static void update_tasks_root_domain(struct cpuset *cs)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	css_task_iter_start(&cs->css, 0, &it);
> +
> +	while ((task = css_task_iter_next(&it)))
> +		dl_add_task_root_domain(task);
> +
> +	css_task_iter_end(&it);
> +}
> +
> +/*
> + * Called with cpuset_mutex held (rebuild_sched_domains())
> + * Called with hotplug lock held (rebuild_sched_domains_locked())
> + * Called with sched_domains_mutex held (partition_and_rebuild_domains())
> + */
> +static void rebuild_root_domains(void)
> +{
> +	struct cpuset *cs = NULL;
> +	struct cgroup_subsys_state *pos_css;
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * Clear default root domain DL accounting, it will be computed again
> +	 * if a task belongs to it.
> +	 */
> +	dl_clear_root_domain(&def_root_domain);

Can't a __sched_setscheduler sneak in at this point, set a task to
DEADLINE, add its bandwidth to the rd...

> +
> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> +
> +		if (cpumask_empty(cs->effective_cpus)) {
> +			pos_css = css_rightmost_descendant(pos_css);
> +			continue;
> +		}
> +
> +		css_get(&cs->css);
> +
> +		rcu_read_unlock();
> +
> +		update_tasks_root_domain(cs);

... and this adds it again?

> +
> +		rcu_read_lock();
> +		css_put(&cs->css);
> +	}
> +	rcu_read_unlock();
> +}

Best,

- Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting
  2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (6 preceding siblings ...)
  2018-02-01 16:51 ` [PATCH V2 7/7] sched/deadline: Prevent CPU hotplug operation if DL task on CPU Mathieu Poirier
@ 2018-02-02 13:17 ` Luca Abeni
  2018-02-05 20:48   ` Mathieu Poirier
  7 siblings, 1 reply; 17+ messages in thread
From: Luca Abeni @ 2018-02-02 13:17 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, juri.lelli, linux-kernel

Hi Mathieu,

On Thu,  1 Feb 2018 09:51:02 -0700
Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

> This is the follow-up patchset to [1] that attempt to fix a problem
> reported by Steve Rostedt [2] where DL bandwidth accounting is not
> recomputed after CPUset and CPU hotplug operations.  When CPU hotplug and
> some CUPset manipulation take place root domains are destroyed and new ones
> created, loosing at the same time DL accounting information pertaining to
> utilisation.  Please see [1] for a full description of the approach.

I do not know the cgroup / cpuset code too much, so I have no useful
comments on your patches... But I think this patchset is a nice
improvemnt respect to the current situation.

[...]
> A notable addition is patch 7/7 - it addresses a problem seen when hot
> plugging out a CPU where a DL task is running (see changelog for full
> details).  The issue is unrelated to this patchset and will manifest
> itself on a mainline kernel.

I think I introduced this bug with my reclaiming patches, so I am
interested.
When a cpu is hot-plugged out, which code in the kernel is responsible
for migrating the tasks that are executing on such CPU? I was sure I
was handling all the relevant codepaths, but this bug clearly shows
that I was wrong.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary
  2018-02-01 16:51 ` [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary Mathieu Poirier
@ 2018-02-02 14:35   ` Juri Lelli
  2018-02-05 18:58     ` Mathieu Poirier
  0 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2018-02-02 14:35 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, luca.abeni, linux-kernel

Hi Mathieu,

On 01/02/18 09:51, Mathieu Poirier wrote:
> When considering to move a task to the DL policy we need to make sure
> the CPUs it is allowed to run on matches the CPUs of the root domains of
> the runqueue it is currently assigned to.  Otherwise the task will be
> allowed to roam on CPUs outside of this root domain, something that will
> skew system deadline statistics and potentially lead to over selling DL
> bandwidth.
> 
> For example say we have a 4 core system split in 2 cpuset: set1 has CPU 0
> and 1 while set2 has CPU 2 and 3.  This results in 3 cpuset - the default
> set that has all 4 CPUs along with set1 and set2 as just depicted.  We also
> have task A that hasn't been assigned to any CPUset and as such, is part of
> the default CPUset.
> 
> At the time we want to move task A to a DL policy it has been assigned to
> CPU1.  Since CPU1 is part of set1 the root domain will have 2 CPUs in it
> and the bandwidth constraint checked against the current DL bandwidth
> allotment of those 2 CPUs.

Wait.. I'm confused. :)

Do you disabled cpuset.sched_load_balance in the root (default) cpuset?
If yes, we would end up with 2 root domains and if task A happens to be
on root domain (0-1) checking its admission against 2 CPUs looks like
the right thing to do to me. If no, then there is a single root domain
(the root/deafult one) with 4 CPUs, and it indeed seems that we've
probably got a problem: it is possible for a DEADLINE task running on
root/default cpuset to be put in (for example) 0-1 cpuset, and so
restrict its affinity. Is it this that this patch cures?

Anyway, see more comments below..

[...]

>  	/*
> +	 * If setscheduling to SCHED_DEADLINE we need to make sure the task
> +	 * is constrained to run within the root domain it is associated with,
> +	 * something that isn't guaranteed when using cpusets.
> +	 *
> +	 * Speaking of cpusets, we also need to assert that a task's
> +	 * cpus_allowed mask equals its cpuset's cpus_allowed mask. Otherwise
> +	 * a DL task could be assigned to a cpuset that has more CPUs than the
> +	 * root domain it is associated with, a situation that yields no
> +	 * benefits and greatly complicate the management of DL task when
> +	 * cpusets are present.
> +	 */
> +	if (dl_policy(policy)) {
> +		struct root_domain *rd = cpu_rq(task_cpu(p))->rd;

I fear root_domain doesn't exist on UP.

Maybe this logic can be put above changing the check we already do
against the span?

https://elixir.free-electrons.com/linux/latest/source/kernel/sched/core.c#L4174

Best,

- Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2018-02-02 10:19   ` Juri Lelli
@ 2018-02-05 18:11     ` Mathieu Poirier
  2018-02-06  7:42       ` Juri Lelli
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-05 18:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, linux-kernel

On 2 February 2018 at 03:19, Juri Lelli <juri.lelli@redhat.com> wrote:
> Hi Mathieu,
>
> On 01/02/18 09:51, Mathieu Poirier wrote:
>> Introducing function partition_sched_domains_locked() by taking
>> the mutex locking code out of the original function.  That way
>> the work done by partition_sched_domains_locked() can be reused
>> without dropping the mutex lock.
>>
>> This patch doesn't change the functionality provided by the
>> original code.
>>
>> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>> ---
>
> [...]
>
>> +/*
>> + * Call with hotplug lock held
>
> Is this the one that we can actually check if it's locked with
>
> lockdep_assert_cpus_held()
>
> ?

Hi Juri,

You are correct - we could call lockdep_assert_cpus_held() but in my
opinion it would be in a separate patch and probably outside the scope
of this work.  The sole purpose of this patch is to get the
locking/unlocking operations of mutex sched_domains_mutex out of
function partition_sched_domains_locked().

Mathieu

>
>> + */
>> +void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>> +                          struct sched_domain_attr *dattr_new)
>> +{
>> +     mutex_lock(&sched_domains_mutex);
>> +     partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
>>       mutex_unlock(&sched_domains_mutex);
>>  }
>
> Best,
>
> - Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary
  2018-02-02 14:35   ` Juri Lelli
@ 2018-02-05 18:58     ` Mathieu Poirier
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-05 18:58 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, linux-kernel

On 2 February 2018 at 07:35, Juri Lelli <juri.lelli@redhat.com> wrote:
> Hi Mathieu,
>
> On 01/02/18 09:51, Mathieu Poirier wrote:
>> When considering to move a task to the DL policy we need to make sure
>> the CPUs it is allowed to run on matches the CPUs of the root domains of
>> the runqueue it is currently assigned to.  Otherwise the task will be
>> allowed to roam on CPUs outside of this root domain, something that will
>> skew system deadline statistics and potentially lead to over selling DL
>> bandwidth.
>>
>> For example say we have a 4 core system split in 2 cpuset: set1 has CPU 0
>> and 1 while set2 has CPU 2 and 3.  This results in 3 cpuset - the default
>> set that has all 4 CPUs along with set1 and set2 as just depicted.  We also
>> have task A that hasn't been assigned to any CPUset and as such, is part of
>> the default CPUset.
>>
>> At the time we want to move task A to a DL policy it has been assigned to
>> CPU1.  Since CPU1 is part of set1 the root domain will have 2 CPUs in it
>> and the bandwidth constraint checked against the current DL bandwidth
>> allotment of those 2 CPUs.
>
> Wait.. I'm confused. :)

Rightly so - it is confusing.

>
> Do you disabled cpuset.sched_load_balance in the root (default) cpuset?

Correct.  I was trying to be as clear as possible but also avoid
writing too much - I'll make that fact explicit in the next revision.

> If yes, we would end up with 2 root domains and if task A happens to be
> on root domain (0-1) checking its admission against 2 CPUs looks like
> the right thing to do to me.

So the task is running on CPU1 and as such admission control will be
done against root domain (0-1).  The problem here is that task A isn't
part of set1 (hence root domain (0-1)), it is part of the default
cpuset and that set also includes root domain (2-3) - and that is a
problem.


> If no, then there is a single root domain
> (the root/deafult one) with 4 CPUs, and it indeed seems that we've
> probably got a problem: it is possible for a DEADLINE task running on
> root/default cpuset to be put in (for example) 0-1 cpuset, and so
> restrict its affinity. Is it this that this patch cures?

That is exactly what this patch does.  It will prevent a task from
being promoted to DL if it is part of a cpuset (any cpuset) that has
its cpuset.sched_load_balance flag disabled and also has populated
children cpusets.  That way we prevent tasks from spanning multiple
root domains.

>
> Anyway, see more comments below..
>
> [...]
>
>>       /*
>> +      * If setscheduling to SCHED_DEADLINE we need to make sure the task
>> +      * is constrained to run within the root domain it is associated with,
>> +      * something that isn't guaranteed when using cpusets.
>> +      *
>> +      * Speaking of cpusets, we also need to assert that a task's
>> +      * cpus_allowed mask equals its cpuset's cpus_allowed mask. Otherwise
>> +      * a DL task could be assigned to a cpuset that has more CPUs than the
>> +      * root domain it is associated with, a situation that yields no
>> +      * benefits and greatly complicate the management of DL task when
>> +      * cpusets are present.
>> +      */
>> +     if (dl_policy(policy)) {
>> +             struct root_domain *rd = cpu_rq(task_cpu(p))->rd;
>
> I fear root_domain doesn't exist on UP.
>
> Maybe this logic can be put above changing the check we already do
> against the span?

Yes, indeed.  I'll fix that.

>
> https://elixir.free-electrons.com/linux/latest/source/kernel/sched/core.c#L4174
>
> Best,
>
> - Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information
  2018-02-02 12:52   ` Juri Lelli
@ 2018-02-05 18:59     ` Mathieu Poirier
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-05 18:59 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, linux-kernel

On 2 February 2018 at 05:52, Juri Lelli <juri.lelli@redhat.com> wrote:
> Hi Mathieu,
>
> On 01/02/18 09:51, Mathieu Poirier wrote:
>> When the topology of root domains is modified by CPUset or CPUhotplug
>> operations information about the current deadline bandwidth held in the
>> root domain is lost.
>>
>> This patch address the issue by recalculating the lost deadline
>> bandwidth information by circling through the deadline tasks held in
>> CPUsets and adding their current load to the root domain they are
>> associated with.
>>
>> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>> ---
>
> [...]
>
>> +static void update_tasks_root_domain(struct cpuset *cs)
>> +{
>> +     struct css_task_iter it;
>> +     struct task_struct *task;
>> +
>> +     css_task_iter_start(&cs->css, 0, &it);
>> +
>> +     while ((task = css_task_iter_next(&it)))
>> +             dl_add_task_root_domain(task);
>> +
>> +     css_task_iter_end(&it);
>> +}
>> +
>> +/*
>> + * Called with cpuset_mutex held (rebuild_sched_domains())
>> + * Called with hotplug lock held (rebuild_sched_domains_locked())
>> + * Called with sched_domains_mutex held (partition_and_rebuild_domains())
>> + */
>> +static void rebuild_root_domains(void)
>> +{
>> +     struct cpuset *cs = NULL;
>> +     struct cgroup_subsys_state *pos_css;
>> +
>> +     rcu_read_lock();
>> +
>> +     /*
>> +      * Clear default root domain DL accounting, it will be computed again
>> +      * if a task belongs to it.
>> +      */
>> +     dl_clear_root_domain(&def_root_domain);
>
> Can't a __sched_setscheduler sneak in at this point, set a task to
> DEADLINE, add its bandwidth to the rd...

Let me think about that one for a bit.

Thanks for the review,
Mathieu

>
>> +
>> +     cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>> +
>> +             if (cpumask_empty(cs->effective_cpus)) {
>> +                     pos_css = css_rightmost_descendant(pos_css);
>> +                     continue;
>> +             }
>> +
>> +             css_get(&cs->css);
>> +
>> +             rcu_read_unlock();
>> +
>> +             update_tasks_root_domain(cs);
>
> ... and this adds it again?
>
>> +
>> +             rcu_read_lock();
>> +             css_put(&cs->css);
>> +     }
>> +     rcu_read_unlock();
>> +}
>
> Best,
>
> - Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting
  2018-02-02 13:17 ` [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Luca Abeni
@ 2018-02-05 20:48   ` Mathieu Poirier
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Poirier @ 2018-02-05 20:48 UTC (permalink / raw)
  To: Luca Abeni
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, juri.lelli, linux-kernel

On Fri, Feb 02, 2018 at 02:17:50PM +0100, Luca Abeni wrote:
> Hi Mathieu,
> 
> On Thu,  1 Feb 2018 09:51:02 -0700
> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
> 
> > This is the follow-up patchset to [1] that attempt to fix a problem
> > reported by Steve Rostedt [2] where DL bandwidth accounting is not
> > recomputed after CPUset and CPU hotplug operations.  When CPU hotplug and
> > some CUPset manipulation take place root domains are destroyed and new ones
> > created, loosing at the same time DL accounting information pertaining to
> > utilisation.  Please see [1] for a full description of the approach.
> 
> I do not know the cgroup / cpuset code too much, so I have no useful
> comments on your patches... But I think this patchset is a nice
> improvemnt respect to the current situation.
> 
> [...]
> > A notable addition is patch 7/7 - it addresses a problem seen when hot
> > plugging out a CPU where a DL task is running (see changelog for full
> > details).  The issue is unrelated to this patchset and will manifest
> > itself on a mainline kernel.
> 
> I think I introduced this bug with my reclaiming patches, so I am
> interested.
> When a cpu is hot-plugged out, which code in the kernel is responsible
> for migrating the tasks that are executing on such CPU?

sched_cpu_deactivate()
    cpuset_cpu_inactive()
        cpuset_update_active_cpus()
            cpuset_hotplug_workfn()
                hotplug_update_tasks_legacy()
                    hotplug_update_tasks()
                        set_cpus_allowed_ptr()
                            __set_cpus_allowed_ptr()


> I was sure I
> was handling all the relevant codepaths, but this bug clearly shows
> that I was wrong.

I remember reviewing your patchset and I too thought you had tackled all the
cases.  In function __set_cpus_allowed_ptr() you'll notice two cases are
handled, i.e the task is running or suspended.  I suspect the former to be the
culprit but haven't investigated fully.

Regards,
Mathieu
                                


> 
> 
> 			Thanks,
> 				Luca

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2018-02-05 18:11     ` Mathieu Poirier
@ 2018-02-06  7:42       ` Juri Lelli
  0 siblings, 0 replies; 17+ messages in thread
From: Juri Lelli @ 2018-02-06  7:42 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, linux-kernel

On 05/02/18 11:11, Mathieu Poirier wrote:
> On 2 February 2018 at 03:19, Juri Lelli <juri.lelli@redhat.com> wrote:
> > Hi Mathieu,
> >
> > On 01/02/18 09:51, Mathieu Poirier wrote:
> >> Introducing function partition_sched_domains_locked() by taking
> >> the mutex locking code out of the original function.  That way
> >> the work done by partition_sched_domains_locked() can be reused
> >> without dropping the mutex lock.
> >>
> >> This patch doesn't change the functionality provided by the
> >> original code.
> >>
> >> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> >> ---
> >
> > [...]
> >
> >> +/*
> >> + * Call with hotplug lock held
> >
> > Is this the one that we can actually check if it's locked with
> >
> > lockdep_assert_cpus_held()
> >
> > ?
> 
> Hi Juri,
> 
> You are correct - we could call lockdep_assert_cpus_held() but in my
> opinion it would be in a separate patch and probably outside the scope
> of this work.  The sole purpose of this patch is to get the
> locking/unlocking operations of mutex sched_domains_mutex out of
> function partition_sched_domains_locked().

Fair enough. I just thought though that, since you are adding the comment
above, we could as well add an explicit check for hotplug lock.

Best,

- Juri

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-02-06  7:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-01 16:51 [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
2018-02-01 16:51 ` [PATCH V2 1/7] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
2018-02-02 10:19   ` Juri Lelli
2018-02-05 18:11     ` Mathieu Poirier
2018-02-06  7:42       ` Juri Lelli
2018-02-01 16:51 ` [PATCH V2 2/7] cpuset: Rebuild root domain deadline accounting information Mathieu Poirier
2018-02-02 12:52   ` Juri Lelli
2018-02-05 18:59     ` Mathieu Poirier
2018-02-01 16:51 ` [PATCH V2 3/7] sched/deadline: Keep new DL task within root domain's boundary Mathieu Poirier
2018-02-02 14:35   ` Juri Lelli
2018-02-05 18:58     ` Mathieu Poirier
2018-02-01 16:51 ` [PATCH V2 4/7] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present Mathieu Poirier
2018-02-01 16:51 ` [PATCH V2 5/7] cgroup: Constrain the addition of CPUs to a new CPUset Mathieu Poirier
2018-02-01 16:51 ` [PATCH V2 6/7] sched/core: Don't change the affinity of DL tasks Mathieu Poirier
2018-02-01 16:51 ` [PATCH V2 7/7] sched/deadline: Prevent CPU hotplug operation if DL task on CPU Mathieu Poirier
2018-02-02 13:17 ` [PATCH V2 0/7] sched/deadline: fix cpusets bandwidth accounting Luca Abeni
2018-02-05 20:48   ` Mathieu Poirier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).