[PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting
@ 2018-02-13 20:32 Mathieu Poirier
  2018-02-13 20:32   ` Mathieu Poirier
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

This is the third installment of a patchset that attempt to fix a problem
reported by Steve Rostedt [1] where DL bandwidth accounting is not
recomputed after CPUset and CPU hotplug operations took place.  When CPU
hotplug and some CUPset manipulation take place root domains are destroyed
and new ones created, loosing at the same time DL accounting information
pertaining to utilisation.  Please see [2] for a full description of the
approach.

The most notable change in this revision is the resolution of
synchronisation issues between function __sched_setscheduler() and the
the CPUset subsystem as pointed out by Juri Lelli.

As with the second revision this set is available here [3] with the 
instrumentattion for patch 10/10 in this commit [4].  

This set applies cleanly on top of v4.16-rc1.

Best regards,
Mathieu

------
Changes for V3:
. Addressed potential race conditions between the CPUset subsystem and
  function __sched_setscheduler().
. Added a lockdep asset to function partition_sched_domains(). 

Change for V2:
. Addressing a problem found by Luca Abeni where the mask of a DL task
  isn't modified when cpuset are collapsed.

[1]. https://lkml.org/lkml/2016/2/3/966
[2]. https://groups.google.com/forum/#!topic/linux.kernel/uakbvOQE6rc
[3]. https://git.linaro.org/people/mathieu.poirier/linux.git/log/?h=v4.16-rc1-bandwidth-accounting-v3
[4]. 4a95e8ab0881 sched/debug: Add 'rq_debug' proc entry

Mathieu Poirier (10):
  sched/topology: Add check to backup comment about hotplug lock
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()
  cpuset: Rebuild root domain deadline accounting information
  sched/deadline: Keep new DL task within root domain's boundary
  cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
  cgroup: Constrain the addition of CPUs to a new CPUset
  sched/core: Don't change the affinity of DL tasks
  sched/deadline: Prevent CPU hotplug operation if DL task on CPU

 include/linux/cpuset.h         |  12 ++
 include/linux/sched.h          |   5 +
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |  10 ++
 kernel/cgroup/cpuset.c         | 246 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c            |  63 ++++++++---
 kernel/sched/deadline.c        |  36 ++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  33 +++++-
 9 files changed, 392 insertions(+), 24 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V3 01/10] sched/topology: Add check to backup comment about hotplug lock
@ 2018-02-13 20:32   ` Mathieu Poirier
  0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

The comment above function partition_sched_domains() clearly state that
the cpu_hotplug_lock should be held but doesn't mandate one to abide to
it.  Adding an explicit check backs that comment and make it impossible
for anyone to miss the requirement.

Suggested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/topology.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8010c2974d30..7c744c7425ec 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1863,6 +1863,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	int i, j, n;
 	int new_topology;
 
+	lockdep_assert_cpus_held();
 	mutex_lock(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 01/10] sched/topology: Add check to backup comment about hotplug lock
@ 2018-02-13 20:32   ` Mathieu Poirier
  0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, claudio-YOzL5CV4y4YG1A2ADO40+w,
	bristot-H+wXaHxf7aLQT0dZR+AlfA,
	tommaso.cucinotta-5rdYK369eBLQB0XuIGIEkQ,
	juri.lelli-H+wXaHxf7aLQT0dZR+AlfA,
	luca.abeni-5rdYK369eBLQB0XuIGIEkQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

The comment above function partition_sched_domains() clearly state that
the cpu_hotplug_lock should be held but doesn't mandate one to abide to
it.  Adding an explicit check backs that comment and make it impossible
for anyone to miss the requirement.

Suggested-by: Juri Lelli <juri.lelli-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Mathieu Poirier <mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---
 kernel/sched/topology.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8010c2974d30..7c744c7425ec 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1863,6 +1863,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	int i, j, n;
 	int new_topology;
 
+	lockdep_assert_cpus_held();
 	mutex_lock(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 02/10] sched/topology: Adding function partition_sched_domains_locked()
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
  2018-02-13 20:32   ` Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 03/10] sched/core: Streamlining calls to task_rq_unlock() Mathieu Poirier
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 18 ++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..57997caf61b6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -162,6 +162,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -207,6 +211,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 struct sched_domain_attr;
 
 static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
+static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 7c744c7425ec..eee721da40fb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1855,16 +1855,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	int i, j, n;
 	int new_topology;
 
 	lockdep_assert_cpus_held();
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -1929,7 +1929,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	lockdep_assert_cpus_held();
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 03/10] sched/core: Streamlining calls to task_rq_unlock()
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
  2018-02-13 20:32   ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 02/10] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Mathieu Poirier
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamline the release of the rq lock so that only one location
need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf724c1952ea..f727c3d0064c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4189,8 +4189,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4206,8 +4206,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4220,8 +4220,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4236,8 +4236,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4256,8 +4256,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4313,6 +4313,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (2 preceding siblings ...)
  2018-02-13 20:32 ` [PATCH V3 03/10] sched/core: Streamlining calls to task_rq_unlock() Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-14 10:36   ` Juri Lelli
  2018-02-13 20:32   ` Mathieu Poirier
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

No synchronisation mechanism exist between the cpuset subsystem and calls
to function __sched_setscheduler().  As such it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

By making available the cpuset_mutex to the core scheduler it is possible
to prevent situations such as the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 16 ++++++++++++++++
 kernel/sched/core.c    |  9 +++++++++
 3 files changed, 31 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..4bbb3f5a3020 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_lock(void);
+extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,10 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_lock(void) { }
+
+static inline void cpuset_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d18c72e83de4..41f8391640e6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2410,6 +2410,22 @@ void __init cpuset_init_smp(void)
 }
 
 /**
+ * cpuset_lock - Grab the cpuset_mutex from another subsysytem
+ */
+void cpuset_lock(void)
+{
+	mutex_lock(&cpuset_mutex);
+}
+
+/**
+ * cpuset_unlock - Release the cpuset_mutex from another subsysytem
+ */
+void cpuset_unlock(void)
+{
+	mutex_unlock(&cpuset_mutex);
+}
+
+/**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
  * @pmask: pointer to struct cpumask variable to receive cpus_allowed set.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f727c3d0064c..0d8badcf1f0f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_lock();
+
+	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
 	 *
@@ -4247,6 +4254,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
 		task_rq_unlock(rq, p, &rf);
+		cpuset_unlock();
 		goto recheck;
 	}
 
@@ -4316,6 +4324,7 @@ static int __sched_setscheduler(struct task_struct *p,
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
+	cpuset_unlock();
 	return retval;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 05/10] cpuset: Rebuild root domain deadline accounting information
@ 2018-02-13 20:32   ` Mathieu Poirier
  0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch address the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched.h          |  5 ++++
 include/linux/sched/deadline.h |  8 ++++++
 kernel/cgroup/cpuset.c         | 63 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 31 +++++++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 14 +++++++++-
 6 files changed, 119 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a902e..b9bd7da8dc85 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -239,6 +239,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index a5bc8728ead7..050961659b1d 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -29,4 +29,12 @@ static inline bool dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
+
 #endif /* _LINUX_SCHED_DEADLINE_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 41f8391640e6..d8108030b754 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -812,6 +813,66 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+/*
+ * Called with cpuset_mutex held (rebuild_sched_domains())
+ * Called with hotplug lock held (rebuild_sched_domains_locked())
+ * Called with sched_domains_mutex held (partition_and_rebuild_domains())
+ */
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -844,7 +905,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9bb0e0c412ec..8eb508cf1990 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2269,6 +2269,37 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+	dl_b->total_bw += p->dl.dl_bw;
+
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb5fc458547f..8aba192d4d17 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -685,9 +685,6 @@ struct root_domain {
 	unsigned long max_cpu_capacity;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index eee721da40fb..3210ded386ba 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -3,6 +3,7 @@
  * Scheduler topology setup/handling methods
  */
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/mutex.h>
 #include <linux/sched/isolation.h>
 #include <linux/proc_fs.h>
@@ -1889,8 +1890,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 05/10] cpuset: Rebuild root domain deadline accounting information
@ 2018-02-13 20:32   ` Mathieu Poirier
  0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, claudio-YOzL5CV4y4YG1A2ADO40+w,
	bristot-H+wXaHxf7aLQT0dZR+AlfA,
	tommaso.cucinotta-5rdYK369eBLQB0XuIGIEkQ,
	juri.lelli-H+wXaHxf7aLQT0dZR+AlfA,
	luca.abeni-5rdYK369eBLQB0XuIGIEkQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch address the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---
 include/linux/sched.h          |  5 ++++
 include/linux/sched/deadline.h |  8 ++++++
 kernel/cgroup/cpuset.c         | 63 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 31 +++++++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 14 +++++++++-
 6 files changed, 119 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a902e..b9bd7da8dc85 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -239,6 +239,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index a5bc8728ead7..050961659b1d 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -29,4 +29,12 @@ static inline bool dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
+
 #endif /* _LINUX_SCHED_DEADLINE_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 41f8391640e6..d8108030b754 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -812,6 +813,66 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+/*
+ * Called with cpuset_mutex held (rebuild_sched_domains())
+ * Called with hotplug lock held (rebuild_sched_domains_locked())
+ * Called with sched_domains_mutex held (partition_and_rebuild_domains())
+ */
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -844,7 +905,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9bb0e0c412ec..8eb508cf1990 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2269,6 +2269,37 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+	dl_b->total_bw += p->dl.dl_bw;
+
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb5fc458547f..8aba192d4d17 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -685,9 +685,6 @@ struct root_domain {
 	unsigned long max_cpu_capacity;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index eee721da40fb..3210ded386ba 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -3,6 +3,7 @@
  * Scheduler topology setup/handling methods
  */
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/mutex.h>
 #include <linux/sched/isolation.h>
 #include <linux/proc_fs.h>
@@ -1889,8 +1890,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 06/10] sched/deadline: Keep new DL task within root domain's boundary
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (4 preceding siblings ...)
  2018-02-13 20:32   ` Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 07/10] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present Mathieu Poirier
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

When considering to move a task to the DL policy we need to make sure
the CPUs it is allowed to run on matches the CPUs of the root domain of
the runqueue it is currently assigned to.  Otherwise the task will be
allowed to roam on CPUs outside of this root domain, something that will
skew system deadline statistics and potentially lead to over selling DL
bandwidth.

For example we have a system where the cpuset.sched_load_balance flag of
the root cpuset has been set to 0 and the 4 core system split in 2 cpuset:
set1 has CPU 0 and 1 while set2 has CPU 2 and 3.  This results in 3 cpuset,
i.e, the default set that has all 4 CPUs along with set1 and set2 as just
depicted.  We also have task A that hasn't been assigned to any CPUset and
as such, is part of the default (root) CPUset.

At the time we want to move task A to a DL policy it has been assigned to
CPU1.  Since CPU1 is part of set1 the root domain will have 2 CPUs in it
and the bandwidth constraint checked against the current DL bandwidth
allotment of those 2 CPUs.

If task A is promoted to a DL policy it's 'cpus_allowed' mask is still
equal to the CPUs in the default CPUset, making it possible for the
scheduler to move it to CPU2 and CPU3, which could also be running DL tasks
of their own.

This patch makes sure that a task's cpus_allowed mask matches the CPUs
of the root domain associated to the runqueue it has been assigned to.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 23 +++++++++++++++++++++++
 kernel/sched/core.c    | 20 ++++++++++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 4bbb3f5a3020..f6a9051de907 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -57,6 +57,7 @@ extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
 extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
+extern bool cpuset_cpus_match_task(struct task_struct *tsk);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -182,6 +183,11 @@ static inline void cpuset_lock(void) { }
 
 static inline void cpuset_unlock(void) { }
 
+static inline bool cpuset_cpus_match_task(struct task_struct *tsk)
+{
+	return true;
+}
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d8108030b754..45a5035ae601 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2487,6 +2487,29 @@ void cpuset_unlock(void)
 }
 
 /**
+ * cpuset_cpus_match_task - return whether a task's cpus_allowed mask matches
+ * that of the cpuset it is assigned to.
+ * @tsk: pointer to the task_struct from which tsk->cpus_allowd is obtained.
+ *
+ * Description: Returns 'true' if the cpus_allowed mask of a task is the same
+ * as the cpus_allowed of the cpuset the task belongs to.  This is useful in
+ * situation where both cpuset and DL tasks are used.
+ */
+bool cpuset_cpus_match_task(struct task_struct *tsk)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&callback_lock, flags);
+	rcu_read_lock();
+	ret = cpumask_equal((task_cs(tsk))->cpus_allowed, &tsk->cpus_allowed);
+	rcu_read_unlock();
+	spin_unlock_irqrestore(&callback_lock, flags);
+
+	return ret;
+}
+
+/**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
  * @pmask: pointer to struct cpumask variable to receive cpus_allowed set.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d8badcf1f0f..b930857f4d14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4237,6 +4237,26 @@ static int __sched_setscheduler(struct task_struct *p,
 			cpumask_t *span = rq->rd->span;
 
 			/*
+			 * If setscheduling to SCHED_DEADLINE we need to make
+			 * sure the task is constrained to run within the root
+			 * domain it is associated with, something that isn't
+			 * guaranteed when using cpusets.
+			 *
+			 * Speaking of cpusets, we also need to assert that a
+			 * task's cpus_allowed mask equals its cpuset's
+			 * cpus_allowed mask. Otherwise a DL task could be
+			 * assigned to a cpuset that has more CPUs than the root
+			 * domain it is associated with, a situation that yields
+			 * no benefits and greatly complicate the management of
+			 * DL task when cpusets are present.
+			 */
+			if (!cpumask_equal(&p->cpus_allowed, span) ||
+			    !cpuset_cpus_match_task(p)) {
+				retval = -EPERM;
+				goto unlock;
+			}
+
+			/*
 			 * Don't allow tasks with an affinity mask smaller than
 			 * the entire root_domain to become SCHED_DEADLINE. We
 			 * will also fail if there's no bandwidth available.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 07/10] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (5 preceding siblings ...)
  2018-02-13 20:32 ` [PATCH V3 06/10] sched/deadline: Keep new DL task within root domain's boundary Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 08/10] cgroup: Constrain the addition of CPUs to a new CPUset Mathieu Poirier
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

This patch prevents the 'sched_load_balance' flag from being set to 0
when DL tasks are present in a CPUset.  Otherwise we end up with the
DL tasks using CPUs belonging to different root domains, something that
breaks the mathematical model behind DL bandwidth management.

For example on a 4 core system CPUset "set1" has been created and CPUs
0 and 1 assigned to it.  A DL task has also been spun off.  By default
the DL task can use all the CPUs in the default CPUset.

If we set the base CPUset's cpuset.sched_load_balance to 0, CPU 0 and 1
are added to a newly created root domain while CPU 2 and 3 endup in the
default root domain.  But the DL task is still part of the base CPUset and
as such can use CPUs 0 to 3, spanning at the same time more than one root
domain.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/cgroup/cpuset.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 45a5035ae601..4f5e8bac5337 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -458,6 +458,106 @@ static void free_trial_cpuset(struct cpuset *trial)
 	kfree(trial);
 }
 
+static bool cpuset_has_dl_tasks(struct cpuset *cs)
+{
+	bool dl_tasks = false;
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	/* Go through each task in @cs looking for a DL task */
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while (!dl_tasks && (task = css_task_iter_next(&it))) {
+		if (dl_task(task))
+			dl_tasks = true;
+	}
+
+	css_task_iter_end(&it);
+
+	return dl_tasks;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int
+validate_change_load_balance(struct cpuset *cur, struct cpuset *trial)
+{
+	bool populated = false, dl_tasks = false;
+	int ret = -EBUSY;
+	struct cgroup_subsys_state *pos_css;
+	struct cpuset *cs;
+
+	 /* Bail out if nothing has changed. */
+	if (is_sched_load_balance(cur) ==
+	    is_sched_load_balance(trial)) {
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * First deal with the generic case that applies when
+	 * cpuset.sched_load_balance gets flipped on a cpuset,
+	 * regardless of the value.
+	 */
+	cpuset_for_each_descendant_pre(cs, pos_css, cur) {
+		if (cpuset_has_dl_tasks(cs))
+			dl_tasks = true;
+
+		/* Skip the top cpuset since it obviously exists */
+		if (cs == cur)
+			continue;
+
+		/* Children without CPUs are not important */
+		if (cpumask_empty(cs->cpus_allowed)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		/* CPUs have been assigned to this cpuset. */
+		populated = true;
+
+		/*
+		 * Go no further if both conditions are true so that we
+		 * don't end up in a situation where a DL task is
+		 * spanning more than one root domain or only assigned
+		 * to a subset of the CPUs in a root domain.
+		 */
+		if (populated && dl_tasks)
+			goto out;
+	}
+
+	/*
+	 * Things get very complicated when dealing with children cpuset,
+	 * resulting in hard to maintain code and low confidence that
+	 * all cases are handled properly.  As such prevent the
+	 * cpuset.sched_load_balance from being modified on children cpuset
+	 * where DL tasks have been assigned (or any of its children).
+	 */
+	if (dl_tasks && parent_cs(cur))
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int
+validate_dl_change(struct cpuset *cur, struct cpuset *trial)
+{
+	int ret = 0;
+
+	/* Check if the sched_load_balance flag has been changed */
+	ret = validate_change_load_balance(cur, trial);
+	if (ret)
+		return ret;
+
+	return ret;
+}
+
 /*
  * validate_change() - Used to validate that any proposed cpuset change
  *		       follows the structural rules for cpusets.
@@ -492,6 +592,10 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 		if (!is_cpuset_subset(c, trial))
 			goto out;
 
+	/* Make sure changes are compatible with deadline scheduling class */
+	if (validate_dl_change(cur, trial))
+		goto out;
+
 	/* Remaining checks don't apply to root cpuset */
 	ret = 0;
 	if (cur == &top_cpuset)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 08/10] cgroup: Constrain the addition of CPUs to a new CPUset
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (6 preceding siblings ...)
  2018-02-13 20:32 ` [PATCH V3 07/10] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 09/10] sched/core: Don't change the affinity of DL tasks Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 10/10] sched/deadline: Prevent CPU hotplug operation if DL task on CPU Mathieu Poirier
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

Care must be taken when CPUs are added to a new CPUset.  If an ancestor
of that set has its sched_load_balance flag switch off then the CPUs in
the new CPUset will be added to a new root domain.  If the ancestor also
had DL tasks those will end up covering more than one root domain,
breaking at the same time the DL integrity model.

This patch prevents adding CPUs to a new CPUset if one of its ancestor
had its sched_load_balance flag off and had DL tasks assigned to it.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/cgroup/cpuset.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4f5e8bac5337..4c1b71608c23 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -481,6 +481,43 @@ static bool cpuset_has_dl_tasks(struct cpuset *cs)
  * Assumes RCU read lock and cpuset_mutex are held.
  */
 static int
+validate_change_cpus(struct cpuset *cur, struct cpuset *trial)
+{
+	int ret = 0;
+
+	/*
+	 * CPUs are being added to a CPUset.  If any parent of @trial has its
+	 * sched_load_balance flag switched off this operation will create a
+	 * new root domain spanning trial->cpus_allowed. At the same time
+	 * if any parent of @trial has a DL task, that task will end up
+	 * spanning more than one root domain and break the deadline integrity
+	 * model.
+	 */
+	if (cpumask_weight(cur->cpus_allowed) <
+	    cpumask_weight(trial->cpus_allowed)) {
+		struct cpuset *parent;
+
+		parent = parent_cs(trial);
+		/* Go up until we reach the top_cpuset */
+		while (parent) {
+			if (cpuset_has_dl_tasks(parent) &&
+			    !is_sched_load_balance(parent)) {
+				ret = -EBUSY;
+				goto out;
+			}
+
+			parent = parent_cs(parent);
+		}
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int
 validate_change_load_balance(struct cpuset *cur, struct cpuset *trial)
 {
 	bool populated = false, dl_tasks = false;
@@ -553,8 +590,11 @@ validate_dl_change(struct cpuset *cur, struct cpuset *trial)
 	/* Check if the sched_load_balance flag has been changed */
 	ret = validate_change_load_balance(cur, trial);
 	if (ret)
-		return ret;
+		goto out;
 
+	ret = validate_change_cpus(cur, trial);
+
+out:
 	return ret;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 09/10] sched/core: Don't change the affinity of DL tasks
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (7 preceding siblings ...)
  2018-02-13 20:32 ` [PATCH V3 08/10] cgroup: Constrain the addition of CPUs to a new CPUset Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  2018-02-13 20:32 ` [PATCH V3 10/10] sched/deadline: Prevent CPU hotplug operation if DL task on CPU Mathieu Poirier
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

Now that we mandate that on creation, the ->cpus_allowed mask of a future
DL task has to be equal to the rd->span of the root domain it will be
associated with, changing the affinity of a DL task doesn't make sense
anymore.

Indeed, if we set the task to a smaller affinity set then we may be running
out of CPU cycle.  If we try to increase the size of the affinity set the
new CPUs are necessarily part of another root domain where DL utilisation
for the task won't be accounted for.

As such simply refuse to change the affinity set of a DL task.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b930857f4d14..7a265478d873 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4773,15 +4773,15 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 
 	/*
-	 * Since bandwidth control happens on root_domain basis,
-	 * if admission test is enabled, we only admit -deadline
-	 * tasks allowed to run on all the CPUs in the task's
-	 * root_domain.
+	 * Since bandwidth control happens on root_domain basis, if admission
+	 * test is enabled we don't allow the task' CPUs to change.  If
+	 * @new_mask is smaller than we risk running out of cycles.  If it is
+	 * bigger than we may be using DL bandwidth allocated to other CPUs.
 	 */
 #ifdef CONFIG_SMP
 	if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
 		rcu_read_lock();
-		if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
+		if (!cpumask_equal(task_rq(p)->rd->span, new_mask)) {
 			retval = -EBUSY;
 			rcu_read_unlock();
 			goto out_free_new_mask;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V3 10/10] sched/deadline: Prevent CPU hotplug operation if DL task on CPU
  2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
                   ` (8 preceding siblings ...)
  2018-02-13 20:32 ` [PATCH V3 09/10] sched/core: Don't change the affinity of DL tasks Mathieu Poirier
@ 2018-02-13 20:32 ` Mathieu Poirier
  9 siblings, 0 replies; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-13 20:32 UTC (permalink / raw)
  To: peterz
  Cc: lizefan, mingo, rostedt, claudio, bristot, tommaso.cucinotta,
	juri.lelli, luca.abeni, cgroups, linux-kernel

When a DL task is assigned a CPU the "utilisation" (this_bw) and the
"active utilisation" (running_bw) fields of rq->dl are incremented
accordingly.  If the CPU is hotplugged out the DL task is transferred
to another CPU but the task's contribution to this_bw and running_bw
isn't substracted from the outgoing CPU's rq nor added to the newly
appointed CPU.

In this example (where a kernel has been instrumented to output the
relevant information) we have a 4 CPU system with one 6:10 DL task that
has been assigned to CPU 2:

	root@dragon:/home/linaro/# cat /proc/rq_debug

	dl_rq[0]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 3984588
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[1]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span:			: 0-3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 3984588	<-- RD capacity for 4 CPUs
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[2]:
	  .online			: yes
	  .dl_nr_running		: 1		<-- One task running
	  .running_bw			: 629145	<-- Normal behavior
	  .this_bw			: 629145	<-- Normal behavior
	  .rd->span			: 0-3
	  .dl_nr_migratory		: 1
	  .rd->dl_bw->bw		: 3984588
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[3]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 3984588
	  .rd->dl_bw->total_bw		: 629145

At this point we hotplug out CPU2 and list the status again:

root@dragon:/home/linaro/# echo  0 > /sys/devices/system/cpu/cpu2/online
root@dragon:/home/linaro/# cat /proc/rq_debug

	dl_rq[0]:
	  .online			: yes
	  .dl_nr_running		: 1		<-- DL task was moved here
	  .running_bw			: 0		<-- Contribution not added
	  .this_bw			: 0		<-- Contribution not added
	  .rd->span			: 0-1,3
	  .dl_nr_migratory		: 1
	  .rd->dl_bw->bw		: 2988441	<-- RD capacity updated
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[1]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-1,3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 2988441
	  .rd->dl_bw->total_bw		: 629145

	dl_rq[2]:
	  .online			: no		<-- runqueue no longer online
	  .dl_nr_running		: 0		<-- DL task was moved
	  .running_bw			: 629145	<-- Contribution not substracted
	  .this_bw			: 629145	<-- Contribution not substracted
	  .rd->span			: 2
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 996147
	  .rd->dl_bw->total_bw		: 0

	dl_rq[3]:
	  .online			: yes
	  .dl_nr_running		: 0
	  .running_bw			: 0
	  .this_bw			: 0
	  .rd->span			: 0-1,3
	  .dl_nr_migratory		: 0
	  .rd->dl_bw->bw		: 2988441
	  .rd->dl_bw->total_bw: 629145

Upon rebooting the system a splat is also produced:

[  578.184789] ------------[ cut here ]------------
[  578.184813] dl_rq->running_bw > old
[  578.184838] WARNING: CPU: 0 PID: 4076 at /home/mpoirier/work/linaro/deadline/kernel/kernel/sched/deadline.c:98 dequeue_task_dl+0x128/0x168
[  578.191693] Modules linked in:
[  578.191705] CPU: 0 PID: 4076 Comm: burn Not tainted 4.15.0-00009-gf597fc1e5764-dirty #259
[  578.191708] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[  578.191713] pstate: 60000085 (nZCv daIf -PAN -UAO)
[  578.191718] pc : dequeue_task_dl+0x128/0x168
[  578.191722] lr : dequeue_task_dl+0x128/0x168
[  578.191724] sp : ffff8000383ebbf0
[  578.191727] x29: ffff8000383ebbf0 x28: ffff800038288000
[  578.191733] x27: 0000000000000009 x26: ffff800038890000
[  578.191739] x25: ffff800038994e60 x24: ffff800038994e00
[  578.191744] x23: 0000000000000000 x22: 0000000000000000
[  578.191749] x21: 000000000000000e x20: ffff800038288000
[  578.191755] x19: ffff80003d950aa8 x18: 0000000000000010
[  578.191761] x17: 0000000000000001 x16: 0000000000002710
[  578.191766] x15: 0000000000000006 x14: ffff0000892ed37f
[  578.191772] x13: ffff0000092ed38d x12: 0000000000000000
[  578.191778] x11: ffff8000383eb840 x10: 0000000005f5e0ff
[  578.191784] x9 : 0000000000000034 x8 : 625f676e696e6e75
[  578.191794] x7 : 723e2d71725f6c64 x6 : 000000000000016c
[  578.191800] x5 : 0000000000000000 x4 : 0000000000000000
[  578.191806] x3 : ffffffffffffffff x2 : 000080003480f000
[  578.191812] x1 : ffff800038288000 x0 : 0000000000000017
[  578.191818] Call trace:
[  578.191824]  dequeue_task_dl+0x128/0x168
[  578.191830]  sched_move_task+0xa8/0x150
[  578.191837]  sched_autogroup_exit_task+0x20/0x30
[  578.191843]  do_exit+0x2c4/0x9f8
[  578.191847]  do_group_exit+0x3c/0xa0
[  578.191853]  get_signal+0x2a4/0x568
[  578.191860]  do_signal+0x70/0x210
[  578.191866]  do_notify_resume+0xe0/0x138
[  578.191870]  work_pending+0x8/0x10
[  578.191874] ---[ end trace 345388d10dc698fe ]---

As a stop-gap measure before the real solution is available this patch
prevents users from carrying out a CPU hotplug operation if a DL task is
running (or suspended) on said CPU.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/deadline.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8eb508cf1990..c46aaa7c3569 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2741,11 +2741,16 @@ bool dl_cpu_busy(unsigned int cpu)
 	int cpus;
 
 	rcu_read_lock_sched();
+	overflow = !!(cpu_rq(cpu)->dl.this_bw);
+	if (overflow)
+		goto out;
+
 	dl_b = dl_bw_of(cpu);
 	raw_spin_lock_irqsave(&dl_b->lock, flags);
 	cpus = dl_bw_cpus(cpu);
 	overflow = __dl_overflow(dl_b, cpus, 0, 0);
 	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+out:
 	rcu_read_unlock_sched();
 	return overflow;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-02-13 20:32 ` [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Mathieu Poirier
@ 2018-02-14 10:36   ` Juri Lelli
  2018-02-14 10:49       ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-02-14 10:36 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, luca.abeni, cgroups, linux-kernel

Hi Mathieu,

On 13/02/18 13:32, Mathieu Poirier wrote:
> No synchronisation mechanism exist between the cpuset subsystem and calls
> to function __sched_setscheduler().  As such it is possible that new root
> domains are created on the cpuset side while a deadline acceptance test
> is carried out in __sched_setscheduler(), leading to a potential oversell
> of CPU bandwidth.
> 
> By making available the cpuset_mutex to the core scheduler it is possible
> to prevent situations such as the one described above from happening.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> ---

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f727c3d0064c..0d8badcf1f0f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
>  	}
>  
>  	/*
> +	 * Make sure we don't race with the cpuset subsystem where root
> +	 * domains can be rebuilt or modified while operations like DL
> +	 * admission checks are carried out.
> +	 */
> +	cpuset_lock();
> +
> +	/*

Mmm, I'm afraid we can't do this. __sched_setscheduler might be called
from interrupt contex by normalize_rt_tasks().

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
@ 2018-02-14 10:49       ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-02-14 10:49 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, luca.abeni, cgroups, linux-kernel

On 14/02/18 11:36, Juri Lelli wrote:
> Hi Mathieu,
> 
> On 13/02/18 13:32, Mathieu Poirier wrote:
> > No synchronisation mechanism exist between the cpuset subsystem and calls
> > to function __sched_setscheduler().  As such it is possible that new root
> > domains are created on the cpuset side while a deadline acceptance test
> > is carried out in __sched_setscheduler(), leading to a potential oversell
> > of CPU bandwidth.
> > 
> > By making available the cpuset_mutex to the core scheduler it is possible
> > to prevent situations such as the one described above from happening.
> > 
> > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > ---
> 
> [...]
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f727c3d0064c..0d8badcf1f0f 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	}
> >  
> >  	/*
> > +	 * Make sure we don't race with the cpuset subsystem where root
> > +	 * domains can be rebuilt or modified while operations like DL
> > +	 * admission checks are carried out.
> > +	 */
> > +	cpuset_lock();
> > +
> > +	/*
> 
> Mmm, I'm afraid we can't do this. __sched_setscheduler might be called
> from interrupt contex by normalize_rt_tasks().

Maybe conditionally grabbing it if pi is true could do? I guess we don't
care much about domains when sysrq.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
@ 2018-02-14 10:49       ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-02-14 10:49 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	claudio-YOzL5CV4y4YG1A2ADO40+w, bristot-H+wXaHxf7aLQT0dZR+AlfA,
	tommaso.cucinotta-5rdYK369eBLQB0XuIGIEkQ,
	luca.abeni-5rdYK369eBLQB0XuIGIEkQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 14/02/18 11:36, Juri Lelli wrote:
> Hi Mathieu,
> 
> On 13/02/18 13:32, Mathieu Poirier wrote:
> > No synchronisation mechanism exist between the cpuset subsystem and calls
> > to function __sched_setscheduler().  As such it is possible that new root
> > domains are created on the cpuset side while a deadline acceptance test
> > is carried out in __sched_setscheduler(), leading to a potential oversell
> > of CPU bandwidth.
> > 
> > By making available the cpuset_mutex to the core scheduler it is possible
> > to prevent situations such as the one described above from happening.
> > 
> > Signed-off-by: Mathieu Poirier <mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> > ---
> 
> [...]
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f727c3d0064c..0d8badcf1f0f 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	}
> >  
> >  	/*
> > +	 * Make sure we don't race with the cpuset subsystem where root
> > +	 * domains can be rebuilt or modified while operations like DL
> > +	 * admission checks are carried out.
> > +	 */
> > +	cpuset_lock();
> > +
> > +	/*
> 
> Mmm, I'm afraid we can't do this. __sched_setscheduler might be called
> from interrupt contex by normalize_rt_tasks().

Maybe conditionally grabbing it if pi is true could do? I guess we don't
care much about domains when sysrq.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-02-14 10:49       ` Juri Lelli
  (?)
@ 2018-02-14 11:27       ` Juri Lelli
  2018-02-14 15:33         ` Mathieu Poirier
  -1 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-02-14 11:27 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: peterz, lizefan, mingo, rostedt, claudio, bristot,
	tommaso.cucinotta, luca.abeni, cgroups, linux-kernel

On 14/02/18 11:49, Juri Lelli wrote:
> On 14/02/18 11:36, Juri Lelli wrote:
> > Hi Mathieu,
> > 
> > On 13/02/18 13:32, Mathieu Poirier wrote:
> > > No synchronisation mechanism exist between the cpuset subsystem and calls
> > > to function __sched_setscheduler().  As such it is possible that new root
> > > domains are created on the cpuset side while a deadline acceptance test
> > > is carried out in __sched_setscheduler(), leading to a potential oversell
> > > of CPU bandwidth.
> > > 
> > > By making available the cpuset_mutex to the core scheduler it is possible
> > > to prevent situations such as the one described above from happening.
> > > 
> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > ---
> > 
> > [...]
> > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f727c3d0064c..0d8badcf1f0f 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
> > >  	}
> > >  
> > >  	/*
> > > +	 * Make sure we don't race with the cpuset subsystem where root
> > > +	 * domains can be rebuilt or modified while operations like DL
> > > +	 * admission checks are carried out.
> > > +	 */
> > > +	cpuset_lock();
> > > +
> > > +	/*
> > 
> > Mmm, I'm afraid we can't do this. __sched_setscheduler might be called
> > from interrupt contex by normalize_rt_tasks().
> 
> Maybe conditionally grabbing it if pi is true could do? I guess we don't
> care much about domains when sysrq.

Ops.. just got this. :/

--->8---
[    0.020203] ======================================================
[    0.020946] WARNING: possible circular locking dependency detected
[    0.021000] 4.16.0-rc1+ #64 Not tainted
[    0.021000] ------------------------------------------------------
[    0.021000] swapper/0/1 is trying to acquire lock:
[    0.021000]  (cpu_hotplug_lock.rw_sem){.+.+}, at: [<000000007164d41d>] smpboot_register_percpu_thread_cpumask+0x2d/0x100
[    0.021000]
[    0.021000] but task is already holding lock:
[    0.021000]  (cpuset_mutex){+.+.}, at: [<000000008529a52c>] __sched_setscheduler+0xb5/0x8b0
[    0.021000]
[    0.021000] which lock already depends on the new lock.
[    0.021000]
[    0.021000]
[    0.021000] the existing dependency chain (in reverse order) is:
[    0.021000]
[    0.021000] -> #2 (cpuset_mutex){+.+.}:
[    0.021000]        __sched_setscheduler+0xb5/0x8b0
[    0.021000]        _sched_setscheduler+0x6c/0x80
[    0.021000]        __kthread_create_on_node+0x10e/0x170
[    0.021000]        kthread_create_on_node+0x37/0x40
[    0.021000]        kthread_create_on_cpu+0x27/0x90
[    0.021000]        __smpboot_create_thread.part.3+0x64/0xe0
[    0.021000]        smpboot_register_percpu_thread_cpumask+0x91/0x100
[    0.021000]        spawn_ksoftirqd+0x37/0x40
[    0.021000]        do_one_initcall+0x3b/0x160
[    0.021000]        kernel_init_freeable+0x118/0x258
[    0.021000]        kernel_init+0xa/0x100
[    0.021000]        ret_from_fork+0x3a/0x50
[    0.021000]
[    0.021000] -> #1 (smpboot_threads_lock){+.+.}:
[    0.021000]        smpboot_register_percpu_thread_cpumask+0x3b/0x100
[    0.021000]        spawn_ksoftirqd+0x37/0x40
[    0.021000]        do_one_initcall+0x3b/0x160
[    0.021000]        kernel_init_freeable+0x118/0x258
[    0.021000]        kernel_init+0xa/0x100
[    0.021000]        ret_from_fork+0x3a/0x50
[    0.021000]
[    0.021000] -> #0 (cpu_hotplug_lock.rw_sem){.+.+}:
[    0.021000]        cpus_read_lock+0x3e/0x80
[    0.021000]        smpboot_register_percpu_thread_cpumask+0x2d/0x100
[    0.021000]        lockup_detector_init+0x3e/0x74
[    0.021000]        kernel_init_freeable+0x146/0x258
[    0.021000]        kernel_init+0xa/0x100
[    0.021000]        ret_from_fork+0x3a/0x50
[    0.021000]
[    0.021000] other info that might help us debug this:
[    0.021000]
[    0.021000] Chain exists of:
[    0.021000]   cpu_hotplug_lock.rw_sem --> smpboot_threads_lock --> cpuset_mutex
[    0.021000]
[    0.021000]  Possible unsafe locking scenario:
[    0.021000]
[    0.021000]        CPU0                    CPU1
[    0.021000]        ----                    ----
[    0.021000]   lock(cpuset_mutex);
[    0.021000]                                lock(smpboot_threads_lock);
[    0.021000]                                lock(cpuset_mutex);
[    0.021000]   lock(cpu_hotplug_lock.rw_sem);
[    0.021000]
[    0.021000]  *** DEADLOCK ***
[    0.021000]
[    0.021000] 1 lock held by swapper/0/1:
[    0.021000]  #0:  (cpuset_mutex){+.+.}, at: [<000000008529a52c>] __sched_setscheduler+0xb5/0x8b0
[    0.021000]
[    0.021000] stack backtrace:
[    0.021000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc1+ #64
[    0.021000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[    0.021000] Call Trace:
[    0.021000]  dump_stack+0x85/0xc5
[    0.021000]  print_circular_bug.isra.38+0x1ce/0x1db
[    0.021000]  __lock_acquire+0x1278/0x1320
[    0.021000]  ? sched_clock_local+0x12/0x80
[    0.021000]  ? lock_acquire+0x9f/0x1f0
[    0.021000]  lock_acquire+0x9f/0x1f0
[    0.021000]  ? smpboot_register_percpu_thread_cpumask+0x2d/0x100
[    0.021000]  cpus_read_lock+0x3e/0x80
[    0.021000]  ? smpboot_register_percpu_thread_cpumask+0x2d/0x100
[    0.021000]  smpboot_register_percpu_thread_cpumask+0x2d/0x100
[    0.021000]  ? set_debug_rodata+0x11/0x11
[    0.021000]  lockup_detector_init+0x3e/0x74
[    0.021000]  kernel_init_freeable+0x146/0x258
[    0.021000]  ? rest_init+0xd0/0xd0
[    0.021000]  kernel_init+0xa/0x100
[    0.021000]  ret_from_fork+0x3a/0x50

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-02-14 11:27       ` Juri Lelli
@ 2018-02-14 15:33         ` Mathieu Poirier
  2018-02-14 16:31           ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Poirier @ 2018-02-14 15:33 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, cgroups, linux-kernel

On 14 February 2018 at 04:27, Juri Lelli <juri.lelli@redhat.com> wrote:
> On 14/02/18 11:49, Juri Lelli wrote:
>> On 14/02/18 11:36, Juri Lelli wrote:
>> > Hi Mathieu,
>> >
>> > On 13/02/18 13:32, Mathieu Poirier wrote:
>> > > No synchronisation mechanism exist between the cpuset subsystem and calls
>> > > to function __sched_setscheduler().  As such it is possible that new root
>> > > domains are created on the cpuset side while a deadline acceptance test
>> > > is carried out in __sched_setscheduler(), leading to a potential oversell
>> > > of CPU bandwidth.
>> > >
>> > > By making available the cpuset_mutex to the core scheduler it is possible
>> > > to prevent situations such as the one described above from happening.
>> > >
>> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>> > > ---
>> >
>> > [...]
>> >
>> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > > index f727c3d0064c..0d8badcf1f0f 100644
>> > > --- a/kernel/sched/core.c
>> > > +++ b/kernel/sched/core.c
>> > > @@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
>> > >   }
>> > >
>> > >   /*
>> > > +  * Make sure we don't race with the cpuset subsystem where root
>> > > +  * domains can be rebuilt or modified while operations like DL
>> > > +  * admission checks are carried out.
>> > > +  */
>> > > + cpuset_lock();
>> > > +
>> > > + /*
>> >
>> > Mmm, I'm afraid we can't do this. __sched_setscheduler might be called
>> > from interrupt contex by normalize_rt_tasks().
>>
>> Maybe conditionally grabbing it if pi is true could do? I guess we don't
>> care much about domains when sysrq.
>
> Ops.. just got this. :/


Arrghhh... Back to the drawing board.

>
> --->8---
> [    0.020203] ======================================================
> [    0.020946] WARNING: possible circular locking dependency detected
> [    0.021000] 4.16.0-rc1+ #64 Not tainted
> [    0.021000] ------------------------------------------------------
> [    0.021000] swapper/0/1 is trying to acquire lock:
> [    0.021000]  (cpu_hotplug_lock.rw_sem){.+.+}, at: [<000000007164d41d>] smpboot_register_percpu_thread_cpumask+0x2d/0x100
> [    0.021000]
> [    0.021000] but task is already holding lock:
> [    0.021000]  (cpuset_mutex){+.+.}, at: [<000000008529a52c>] __sched_setscheduler+0xb5/0x8b0
> [    0.021000]
> [    0.021000] which lock already depends on the new lock.
> [    0.021000]
> [    0.021000]
> [    0.021000] the existing dependency chain (in reverse order) is:
> [    0.021000]
> [    0.021000] -> #2 (cpuset_mutex){+.+.}:
> [    0.021000]        __sched_setscheduler+0xb5/0x8b0
> [    0.021000]        _sched_setscheduler+0x6c/0x80
> [    0.021000]        __kthread_create_on_node+0x10e/0x170
> [    0.021000]        kthread_create_on_node+0x37/0x40
> [    0.021000]        kthread_create_on_cpu+0x27/0x90
> [    0.021000]        __smpboot_create_thread.part.3+0x64/0xe0
> [    0.021000]        smpboot_register_percpu_thread_cpumask+0x91/0x100
> [    0.021000]        spawn_ksoftirqd+0x37/0x40
> [    0.021000]        do_one_initcall+0x3b/0x160
> [    0.021000]        kernel_init_freeable+0x118/0x258
> [    0.021000]        kernel_init+0xa/0x100
> [    0.021000]        ret_from_fork+0x3a/0x50
> [    0.021000]
> [    0.021000] -> #1 (smpboot_threads_lock){+.+.}:
> [    0.021000]        smpboot_register_percpu_thread_cpumask+0x3b/0x100
> [    0.021000]        spawn_ksoftirqd+0x37/0x40
> [    0.021000]        do_one_initcall+0x3b/0x160
> [    0.021000]        kernel_init_freeable+0x118/0x258
> [    0.021000]        kernel_init+0xa/0x100
> [    0.021000]        ret_from_fork+0x3a/0x50
> [    0.021000]
> [    0.021000] -> #0 (cpu_hotplug_lock.rw_sem){.+.+}:
> [    0.021000]        cpus_read_lock+0x3e/0x80
> [    0.021000]        smpboot_register_percpu_thread_cpumask+0x2d/0x100
> [    0.021000]        lockup_detector_init+0x3e/0x74
> [    0.021000]        kernel_init_freeable+0x146/0x258
> [    0.021000]        kernel_init+0xa/0x100
> [    0.021000]        ret_from_fork+0x3a/0x50
> [    0.021000]
> [    0.021000] other info that might help us debug this:
> [    0.021000]
> [    0.021000] Chain exists of:
> [    0.021000]   cpu_hotplug_lock.rw_sem --> smpboot_threads_lock --> cpuset_mutex
> [    0.021000]
> [    0.021000]  Possible unsafe locking scenario:
> [    0.021000]
> [    0.021000]        CPU0                    CPU1
> [    0.021000]        ----                    ----
> [    0.021000]   lock(cpuset_mutex);
> [    0.021000]                                lock(smpboot_threads_lock);
> [    0.021000]                                lock(cpuset_mutex);
> [    0.021000]   lock(cpu_hotplug_lock.rw_sem);
> [    0.021000]
> [    0.021000]  *** DEADLOCK ***
> [    0.021000]
> [    0.021000] 1 lock held by swapper/0/1:
> [    0.021000]  #0:  (cpuset_mutex){+.+.}, at: [<000000008529a52c>] __sched_setscheduler+0xb5/0x8b0
> [    0.021000]
> [    0.021000] stack backtrace:
> [    0.021000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc1+ #64
> [    0.021000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> [    0.021000] Call Trace:
> [    0.021000]  dump_stack+0x85/0xc5
> [    0.021000]  print_circular_bug.isra.38+0x1ce/0x1db
> [    0.021000]  __lock_acquire+0x1278/0x1320
> [    0.021000]  ? sched_clock_local+0x12/0x80
> [    0.021000]  ? lock_acquire+0x9f/0x1f0
> [    0.021000]  lock_acquire+0x9f/0x1f0
> [    0.021000]  ? smpboot_register_percpu_thread_cpumask+0x2d/0x100
> [    0.021000]  cpus_read_lock+0x3e/0x80
> [    0.021000]  ? smpboot_register_percpu_thread_cpumask+0x2d/0x100
> [    0.021000]  smpboot_register_percpu_thread_cpumask+0x2d/0x100
> [    0.021000]  ? set_debug_rodata+0x11/0x11
> [    0.021000]  lockup_detector_init+0x3e/0x74
> [    0.021000]  kernel_init_freeable+0x146/0x258
> [    0.021000]  ? rest_init+0xd0/0xd0
> [    0.021000]  kernel_init+0xa/0x100
> [    0.021000]  ret_from_fork+0x3a/0x50
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-02-14 15:33         ` Mathieu Poirier
@ 2018-02-14 16:31           ` Juri Lelli
  2018-02-15 10:33               ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-02-14 16:31 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, cgroups, linux-kernel

On 14/02/18 08:33, Mathieu Poirier wrote:
> On 14 February 2018 at 04:27, Juri Lelli <juri.lelli@redhat.com> wrote:
> > On 14/02/18 11:49, Juri Lelli wrote:
> >> On 14/02/18 11:36, Juri Lelli wrote:
> >> > Hi Mathieu,
> >> >
> >> > On 13/02/18 13:32, Mathieu Poirier wrote:
> >> > > No synchronisation mechanism exist between the cpuset subsystem and calls
> >> > > to function __sched_setscheduler().  As such it is possible that new root
> >> > > domains are created on the cpuset side while a deadline acceptance test
> >> > > is carried out in __sched_setscheduler(), leading to a potential oversell
> >> > > of CPU bandwidth.
> >> > >
> >> > > By making available the cpuset_mutex to the core scheduler it is possible
> >> > > to prevent situations such as the one described above from happening.
> >> > >
> >> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> >> > > ---
> >> >
> >> > [...]
> >> >
> >> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > > index f727c3d0064c..0d8badcf1f0f 100644
> >> > > --- a/kernel/sched/core.c
> >> > > +++ b/kernel/sched/core.c
> >> > > @@ -4176,6 +4176,13 @@ static int __sched_setscheduler(struct task_struct *p,
> >> > >   }
> >> > >
> >> > >   /*
> >> > > +  * Make sure we don't race with the cpuset subsystem where root
> >> > > +  * domains can be rebuilt or modified while operations like DL
> >> > > +  * admission checks are carried out.
> >> > > +  */
> >> > > + cpuset_lock();
> >> > > +
> >> > > + /*
> >> >
> >> > Mmm, I'm afraid we can't do this. __sched_setscheduler might be called
> >> > from interrupt contex by normalize_rt_tasks().
> >>
> >> Maybe conditionally grabbing it if pi is true could do? I guess we don't
> >> care much about domains when sysrq.
> >
> > Ops.. just got this. :/
> 
> 
> Arrghhh... Back to the drawing board.
> 

Eh.. even though the warning simply happens because unlocking of
cpuset lock is missing

--->8---
@@ -4312,6 +4312,7 @@ static int __sched_setscheduler(struct task_struct *p,                                                            
        /* Avoid rq from going away on us: */                       
        preempt_disable();        
        task_rq_unlock(rq, p, &rf);                                 
+       cpuset_unlock();          
                                  
        if (pi)                   
                rt_mutex_adjust_pi(p);
--->8---

Still grabbing it is a no-go, as do_sched_setscheduler calls
sched_setscheduler from inside an RCU read-side critical section.

So, back to the drawing board indeed. :/

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
@ 2018-02-15 10:33               ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-02-15 10:33 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, cgroups, linux-kernel

On 14/02/18 17:31, Juri Lelli wrote:

[...]

> Still grabbing it is a no-go, as do_sched_setscheduler calls
> sched_setscheduler from inside an RCU read-side critical section.

I was then actually thinking that trylocking might do.. not sure however
if failing with -EBUSY in the contended case is feasible (and about the
general uglyness of the solution :/).

--->8---
 include/linux/cpuset.h | 4 ++--
 kernel/cgroup/cpuset.c | 6 +++---
 kernel/sched/core.c    | 4 +++-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 4bbb3f5a3020..53c3f4e13cb0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,7 +55,7 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
-extern void cpuset_lock(void);
+extern int cpuset_trylock(void);
 extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
@@ -178,7 +178,7 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
-static inline void cpuset_lock(void) { }
+static inline int cpuset_trylock(void) { return 1; }
 
 static inline void cpuset_unlock(void) { }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 41f8391640e6..995aac5b032d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2410,11 +2410,11 @@ void __init cpuset_init_smp(void)
 }
 
 /**
- * cpuset_lock - Grab the cpuset_mutex from another subsysytem
+ * cpuset_trylock - Try to grab the cpuset_mutex atomically from another subsytem
  */
-void cpuset_lock(void)
+int cpuset_trylock(void)
 {
-	mutex_lock(&cpuset_mutex);
+	return mutex_trylock(&cpuset_mutex);
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d8badcf1f0f..b491b406a835 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4180,7 +4180,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * domains can be rebuilt or modified while operations like DL
 	 * admission checks are carried out.
 	 */
-	cpuset_lock();
+	if (!cpuset_trylock())
+		return -EBUSY;
 
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
@@ -4312,6 +4313,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Avoid rq from going away on us: */
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
+	cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
@ 2018-02-15 10:33               ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-02-15 10:33 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 14/02/18 17:31, Juri Lelli wrote:

[...]

> Still grabbing it is a no-go, as do_sched_setscheduler calls
> sched_setscheduler from inside an RCU read-side critical section.

I was then actually thinking that trylocking might do.. not sure however
if failing with -EBUSY in the contended case is feasible (and about the
general uglyness of the solution :/).

--->8---
 include/linux/cpuset.h | 4 ++--
 kernel/cgroup/cpuset.c | 6 +++---
 kernel/sched/core.c    | 4 +++-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 4bbb3f5a3020..53c3f4e13cb0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,7 +55,7 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
-extern void cpuset_lock(void);
+extern int cpuset_trylock(void);
 extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
@@ -178,7 +178,7 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
-static inline void cpuset_lock(void) { }
+static inline int cpuset_trylock(void) { return 1; }
 
 static inline void cpuset_unlock(void) { }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 41f8391640e6..995aac5b032d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2410,11 +2410,11 @@ void __init cpuset_init_smp(void)
 }
 
 /**
- * cpuset_lock - Grab the cpuset_mutex from another subsysytem
+ * cpuset_trylock - Try to grab the cpuset_mutex atomically from another subsytem
  */
-void cpuset_lock(void)
+int cpuset_trylock(void)
 {
-	mutex_lock(&cpuset_mutex);
+	return mutex_trylock(&cpuset_mutex);
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d8badcf1f0f..b491b406a835 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4180,7 +4180,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * domains can be rebuilt or modified while operations like DL
 	 * admission checks are carried out.
 	 */
-	cpuset_lock();
+	if (!cpuset_trylock())
+		return -EBUSY;
 
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
@@ -4312,6 +4313,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Avoid rq from going away on us: */
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
+	cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-02-15 10:33               ` Juri Lelli
  (?)
@ 2018-02-15 11:08               ` Juri Lelli
  -1 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-02-15 11:08 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Li Zefan, Ingo Molnar, Steven Rostedt,
	Claudio Scordino, Daniel Bristot de Oliveira, Tommaso Cucinotta,
	luca.abeni, cgroups, linux-kernel

On 15/02/18 11:33, Juri Lelli wrote:
> On 14/02/18 17:31, Juri Lelli wrote:
> 
> [...]
> 
> > Still grabbing it is a no-go, as do_sched_setscheduler calls
> > sched_setscheduler from inside an RCU read-side critical section.
> 
> I was then actually thinking that trylocking might do.. not sure however
> if failing with -EBUSY in the contended case is feasible (and about the
> general uglyness of the solution :/).

Or, as suggested by Peter in IRC, the following (which still would
require conditional locking for the sysrq case).

--->8---
 kernel/sched/core.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d8badcf1f0f..4e9405d50cbd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4312,6 +4312,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Avoid rq from going away on us: */
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
+	cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);
@@ -4409,10 +4410,16 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setscheduler(p, policy, &lparam);
+	if (!p) {
+		rcu_read_unlock();
+		goto exit;
+	}
+	get_task_struct(p);
 	rcu_read_unlock();
+	retval = sched_setscheduler(p, policy, &lparam);
+	put_task_struct(p);
 
+exit:
 	return retval;
 }
 
@@ -4540,10 +4547,16 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setattr(p, &attr);
+	if (!p) {
+		rcu_read_unlock();
+		goto exit;
+	}
+	get_task_struct(p);
 	rcu_read_unlock();
+	retval = sched_setattr(p, &attr);
+	put_task_struct(p);
 
+exit:
 	return retval;
 }
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-02-15 11:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 20:32 [PATCH V3 00/10] sched/deadline: fix cpusets bandwidth accounting Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 01/10] sched/topology: Add check to backup comment about hotplug lock Mathieu Poirier
2018-02-13 20:32   ` Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 02/10] sched/topology: Adding function partition_sched_domains_locked() Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 03/10] sched/core: Streamlining calls to task_rq_unlock() Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 04/10] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Mathieu Poirier
2018-02-14 10:36   ` Juri Lelli
2018-02-14 10:49     ` Juri Lelli
2018-02-14 10:49       ` Juri Lelli
2018-02-14 11:27       ` Juri Lelli
2018-02-14 15:33         ` Mathieu Poirier
2018-02-14 16:31           ` Juri Lelli
2018-02-15 10:33             ` Juri Lelli
2018-02-15 10:33               ` Juri Lelli
2018-02-15 11:08               ` Juri Lelli
2018-02-13 20:32 ` [PATCH V3 05/10] cpuset: Rebuild root domain deadline accounting information Mathieu Poirier
2018-02-13 20:32   ` Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 06/10] sched/deadline: Keep new DL task within root domain's boundary Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 07/10] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 08/10] cgroup: Constrain the addition of CPUs to a new CPUset Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 09/10] sched/core: Don't change the affinity of DL tasks Mathieu Poirier
2018-02-13 20:32 ` [PATCH V3 10/10] sched/deadline: Prevent CPU hotplug operation if DL task on CPU Mathieu Poirier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.