linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting
@ 2018-06-13 12:17 Juri Lelli
  2018-06-13 12:17 ` [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock Juri Lelli
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Juri Lelli @ 2018-06-13 12:17 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

Hi,

This is v4 of a series of patches, authored by Mathieu (thanks for your
work and for allowing me to try to move this forward), with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

v3 still had issues (IMHO) because __sched_setscheduler() might race
with the aforementioned restore operation (and it actually looks racy
with cpuset ops in general), but grabbing cpuset_mutex from potential
atomic contexs is a no-go.

I reworked v3 solution a bit ending-up with something that seems to be
working [2]. The idea is simply to trylock such mutex and return -EBUSY
to the user if we raced with cpuset ops. It's gross, but didn't find
anything better (and working) yet. :/

I also don't particularly like 05/05, as it introduces lot of DEADLINE-
iness into cpuset.c. I decided not to change Mathieu's patch for the
moment and see if better approaches are suggested (a per-class thing
maybe, even though other classes don't suffer from this problem and it
is so still going to be DEADLINE specific).

I also left out Mathieu's subsequent patches to focus on this crucial
fix. They can easily come later, IMHO.

Set also available at

 https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v4

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966
[2] compare -before (that confirms what Steve saw) with -after
    https://git.io/vhKfW

Mathieu Poirier (5):
  sched/topology: Add check to backup comment about hotplug lock
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()
  cpuset: Rebuild root domain deadline accounting information

 include/linux/cpuset.h         |  6 ++++
 include/linux/sched.h          |  5 +++
 include/linux/sched/deadline.h |  8 +++++
 include/linux/sched/topology.h | 10 ++++++
 kernel/cgroup/cpuset.c         | 79 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c            | 38 ++++++++++++++------
 kernel/sched/deadline.c        | 31 +++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 32 ++++++++++++++---
 9 files changed, 193 insertions(+), 19 deletions(-)

-- 
2.14.3


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2018-06-13 12:17 ` Juri Lelli
  2018-06-14 13:33   ` Steven Rostedt
  2018-06-13 12:17 ` [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-13 12:17 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

From: Mathieu Poirier <mathieu.poirier@linaro.org>

The comment above function partition_sched_domains() clearly state that
the cpu_hotplug_lock should be held but doesn't mandate one to abide to
it.

Add an explicit check backing that comment, so to make it impossible
for anyone to miss the requirement.

Suggested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
[modified changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/sched/topology.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 61a1125c1ae4..96eee22fafe8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1858,6 +1858,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	int i, j, n;
 	int new_topology;
 
+	lockdep_assert_cpus_held();
 	mutex_lock(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked()
  2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2018-06-13 12:17 ` [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock Juri Lelli
@ 2018-06-13 12:17 ` Juri Lelli
  2018-06-14 13:35   ` Steven Rostedt
  2018-06-13 12:17 ` [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-13 12:17 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 18 ++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..57997caf61b6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -162,6 +162,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -206,6 +210,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 96eee22fafe8..25a5727d3b48 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1850,16 +1850,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	int i, j, n;
 	int new_topology;
 
 	lockdep_assert_cpus_held();
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -1924,6 +1924,16 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	lockdep_assert_cpus_held();
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock()
  2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2018-06-13 12:17 ` [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock Juri Lelli
  2018-06-13 12:17 ` [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2018-06-13 12:17 ` Juri Lelli
  2018-06-14 13:42   ` Steven Rostedt
  2018-06-13 12:17 ` [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
  2018-06-13 12:17 ` [PATCH v4 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  4 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-13 12:17 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d1555185c054..ca788f74259d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4237,8 +4237,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4254,8 +4254,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4268,8 +4268,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4284,8 +4284,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4304,8 +4304,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4361,6 +4361,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (2 preceding siblings ...)
  2018-06-13 12:17 ` [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2018-06-13 12:17 ` Juri Lelli
  2018-06-14 13:45   ` Steven Rostedt
  2018-06-14 20:11   ` Steven Rostedt
  2018-06-13 12:17 ` [PATCH v4 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  4 siblings, 2 replies; 23+ messages in thread
From: Juri Lelli @ 2018-06-13 12:17 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

From: Mathieu Poirier <mathieu.poirier@linaro.org>

No synchronisation mechanism exist between the cpuset subsystem and calls
to function __sched_setscheduler().  As such it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

By making available the cpuset_mutex to the core scheduler it is possible
to prevent situations such as the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
[fixed missing cpuset_unlock() and changed to use mutex_trylock()]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 16 ++++++++++++++++
 kernel/sched/core.c    | 14 ++++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..a1970862ab8e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern int cpuset_lock(void);
+extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,10 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline int cpuset_lock(void) { return 1; }
+
+static inline void cpuset_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e6e81d..d26fd4795aa3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2409,6 +2409,22 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_lock - Grab the cpuset_mutex from another subsysytem
+ */
+int cpuset_lock(void)
+{
+	return mutex_trylock(&cpuset_mutex);
+}
+
+/**
+ * cpuset_unlock - Release the cpuset_mutex from another subsysytem
+ */
+void cpuset_unlock(void)
+{
+	mutex_unlock(&cpuset_mutex);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca788f74259d..a5b0c6c25b44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4218,6 +4218,14 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (attr->sched_flags & SCHED_FLAG_SUGOV)
 			return -EINVAL;
 
+		/*
+		 * Make sure we don't race with the cpuset subsystem where root
+		 * domains can be rebuilt or modified while operations like DL
+		 * admission checks are carried out.
+		 */
+		if (!cpuset_lock())
+			return -EBUSY;
+
 		retval = security_task_setscheduler(p);
 		if (retval)
 			return retval;
@@ -4295,6 +4303,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
 		task_rq_unlock(rq, p, &rf);
+		if (user)
+			cpuset_unlock();
 		goto recheck;
 	}
 
@@ -4352,6 +4362,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Avoid rq from going away on us: */
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
+	if (user)
+		cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);
@@ -4364,6 +4376,8 @@ static int __sched_setscheduler(struct task_struct *p,
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
+	if (user)
+		cpuset_unlock();
 	return retval;
 }
 
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 5/5] cpuset: Rebuild root domain deadline accounting information
  2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (3 preceding siblings ...)
  2018-06-13 12:17 ` [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2018-06-13 12:17 ` Juri Lelli
  4 siblings, 0 replies; 23+ messages in thread
From: Juri Lelli @ 2018-06-13 12:17 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch address the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched.h          |  5 ++++
 include/linux/sched/deadline.h |  8 ++++++
 kernel/cgroup/cpuset.c         | 63 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 31 +++++++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 13 ++++++++-
 6 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 28ff3ca9f752..f7fcc903e1a1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -278,6 +278,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d26fd4795aa3..0ca10418ddf6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -812,6 +813,66 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+/*
+ * Called with cpuset_mutex held (rebuild_sched_domains())
+ * Called with hotplug lock held (rebuild_sched_domains_locked())
+ * Called with sched_domains_mutex held (partition_and_rebuild_domains())
+ */
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -844,7 +905,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1356afd1eeb6..7be2c89b0645 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2275,6 +2275,37 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+	dl_b->total_bw += p->dl.dl_bw;
+
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4d9ac7..83bfacbd13d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -723,9 +723,6 @@ struct root_domain {
 	unsigned long		max_cpu_capacity;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 25a5727d3b48..fb39cee51b14 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1884,8 +1884,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-13 12:17 ` [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock Juri Lelli
@ 2018-06-14 13:33   ` Steven Rostedt
  2018-06-14 13:42     ` Juri Lelli
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2018-06-14 13:33 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, 13 Jun 2018 14:17:07 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> From: Mathieu Poirier <mathieu.poirier@linaro.org>
> 
> The comment above function partition_sched_domains() clearly state that
> the cpu_hotplug_lock should be held but doesn't mandate one to abide to
> it.
> 
> Add an explicit check backing that comment, so to make it impossible
> for anyone to miss the requirement.
> 
> Suggested-by: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> [modified changelog]
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

> ---
>  kernel/sched/topology.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 61a1125c1ae4..96eee22fafe8 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1858,6 +1858,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  	int i, j, n;
>  	int new_topology;
>  
> +	lockdep_assert_cpus_held();
>  	mutex_lock(&sched_domains_mutex);
>  
>  	/* Always unregister in case we don't destroy any domains: */


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked()
  2018-06-13 12:17 ` [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2018-06-14 13:35   ` Steven Rostedt
  2018-06-14 13:47     ` Juri Lelli
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2018-06-14 13:35 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, 13 Jun 2018 14:17:08 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> From: Mathieu Poirier <mathieu.poirier@linaro.org>
> 
> Introducing function partition_sched_domains_locked() by taking
> the mutex locking code out of the original function.  That way
> the work done by partition_sched_domains_locked() can be reused
> without dropping the mutex lock.
> 
> No change of functionality is introduced by this patch.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> ---
>  include/linux/sched/topology.h | 10 ++++++++++
>  kernel/sched/topology.c        | 18 ++++++++++++++----
>  2 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..57997caf61b6 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -162,6 +162,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>  	return to_cpumask(sd->span);
>  }
>  
> +extern void partition_sched_domains_locked(int ndoms_new,
> +					   cpumask_var_t doms_new[],
> +					   struct sched_domain_attr *dattr_new);
> +
>  extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  				    struct sched_domain_attr *dattr_new);
>  
> @@ -206,6 +210,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
>  
>  struct sched_domain_attr;
>  
> +static inline void
> +partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
> +			       struct sched_domain_attr *dattr_new)
> +{
> +}
> +
>  static inline void
>  partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  			struct sched_domain_attr *dattr_new)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 96eee22fafe8..25a5727d3b48 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1850,16 +1850,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
>   * ndoms_new == 0 is a special case for destroying existing domains,
>   * and it will not create the default domain.
>   *
> - * Call with hotplug lock held
> + * Call with hotplug lock and sched_domains_mutex held
>   */
> -void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> -			     struct sched_domain_attr *dattr_new)
> +void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
> +				    struct sched_domain_attr *dattr_new)
>  {
>  	int i, j, n;
>  	int new_topology;
>  
>  	lockdep_assert_cpus_held();
> -	mutex_lock(&sched_domains_mutex);
> +	lockdep_assert_held(&sched_domains_mutex);
>  
>  	/* Always unregister in case we don't destroy any domains: */
>  	unregister_sched_domain_sysctl();
> @@ -1924,6 +1924,16 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  	ndoms_cur = ndoms_new;
>  
>  	register_sched_domain_sysctl();
> +}
>  
> +/*
> + * Call with hotplug lock held
> + */
> +void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> +			     struct sched_domain_attr *dattr_new)
> +{
> +	lockdep_assert_cpus_held();

Is the above assert really necessary? The assert will happen in
partition_sched_domains_locked() anyway.

-- Steve

> +	mutex_lock(&sched_domains_mutex);
> +	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
>  	mutex_unlock(&sched_domains_mutex);
>  }


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 13:33   ` Steven Rostedt
@ 2018-06-14 13:42     ` Juri Lelli
  2018-06-14 13:47       ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-14 13:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 09:33, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:07 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > From: Mathieu Poirier <mathieu.poirier@linaro.org>
> > 
> > The comment above function partition_sched_domains() clearly state that
> > the cpu_hotplug_lock should be held but doesn't mandate one to abide to
> > it.
> > 
> > Add an explicit check backing that comment, so to make it impossible
> > for anyone to miss the requirement.
> > 
> > Suggested-by: Juri Lelli <juri.lelli@redhat.com>
> > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > [modified changelog]
> > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> 
> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> 
> -- Steve
> 
> > ---
> >  kernel/sched/topology.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 61a1125c1ae4..96eee22fafe8 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1858,6 +1858,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> >  	int i, j, n;
> >  	int new_topology;
> >  
> > +	lockdep_assert_cpus_held();
> >  	mutex_lock(&sched_domains_mutex);
> >  
> >  	/* Always unregister in case we don't destroy any domains: */

Thanks for reviewing Steve. I thought this should be fine (also looking
at the comment before the function), but then got this from 0day

[    2.206129] .................................... done.
[    2.206662] Using IPI No-Shortcut mode
[    2.207084] sched_clock: Marking stable (2200014216, 0)->(2435550722, -235536506)
[    2.208792] registered taskstats version 1
[    2.209251] page_owner is disabled
[    2.290164] WARNING: CPU: 0 PID: 13 at kernel/cpu.c:311 lockdep_assert_cpus_held+0x22/0x26
[    2.291253] Modules linked in:
[    2.291589] CPU: 0 PID: 13 Comm: cpuhp/0 Not tainted 4.17.0-rc6-00217-gebf44b4 #2
[    2.292367] EIP: lockdep_assert_cpus_held+0x22/0x26
[    2.292882] EFLAGS: 00210246 CPU: 0
[    2.293255] EAX: 00000000 EBX: 00000000 ECX: c1b1fd98 EDX: 00200286
[    2.293915] ESI: 00000000 EDI: c1b1fcb4 EBP: dd8a3e60 ESP: dd8a3e60
[    2.294578]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    2.295128] CR0: 80050033 CR2: 00000000 CR3: 01c9d000 CR4: 000006d0
[    2.295770] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    2.296409] DR6: fffe0ff0 DR7: 00000400
[    2.296802] Call Trace:
[    2.297066]  partition_sched_domains+0x1b/0x29f
[    2.297557]  ? dl_cpu_busy+0x1b5/0x1c6
[    2.297955]  sched_cpu_deactivate+0x8b/0xb6
[    2.298403]  ? call_rcu_bh+0x17/0x17
[    2.298786]  ? call_rcu_bh+0x17/0x17
[    2.299170]  ? rcu_read_lock_bh_held+0x55/0x55
[    2.299648]  ? sched_cpu_activate+0xcc/0xcc
[    2.300076]  cpuhp_invoke_callback+0x19d/0x978
[    2.300537]  cpuhp_thread_fun+0x125/0x16b
[    2.300949]  smpboot_thread_fn+0x192/0x1a6
[    2.301377]  kthread+0xfc/0x101
[    2.301703]  ? sort_range+0x1d/0x1d
[    2.302062]  ? kthread_flush_work_fn+0x12/0x12
[    2.302524]  ret_from_fork+0x2e/0x40
[    2.302891] Code: ff 83 c4 10 5b 5e 5f 5d c3 55 89 e5 e8 ca dc fe ff 83 3d 20 bc b4 c1 00 74 13 83 ca ff b8 98 fd b1 c1 e8 e2 4b 04 00 85 c0 75 02 <0f> 0b 5d c3 55 89 e5 56 53 e8 a2 dc fe ff 89 c6 3b 05 8c 40 bd 
[    2.304975] irq event stamp: 308
[    2.305335] hardirqs last  enabled at (307): [<c179d83c>] _raw_spin_unlock_irqrestore+0x4a/0x68
[    2.306195] hardirqs last disabled at (308): [<c179ed38>] common_exception+0x46/0x6e
[    2.306969] softirqs last  enabled at (0): [<c10448e6>] copy_process+0x582/0x16c5
[    2.307720] softirqs last disabled at (0): [<00000000>]   (null)
[    2.308328] ---[ end trace 7197acb79221f1b9 ]---

Guess early boot is "special"? :-/

Best,

- Juri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock()
  2018-06-13 12:17 ` [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2018-06-14 13:42   ` Steven Rostedt
  0 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2018-06-14 13:42 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, 13 Jun 2018 14:17:09 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> From: Mathieu Poirier <mathieu.poirier@linaro.org>
> 
> Calls to task_rq_unlock() are done several times in function
> __sched_setscheduler().  This is fine when only the rq lock needs to be
> handled but not so much when other locks come into play.
> 
> This patch streamlines the release of the rq lock so that only one
> location need to be modified when dealing with more than one lock.
> 
> No change of functionality is introduced by this patch.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-06-13 12:17 ` [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2018-06-14 13:45   ` Steven Rostedt
  2018-06-14 13:51     ` Juri Lelli
  2018-06-14 20:11   ` Steven Rostedt
  1 sibling, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2018-06-14 13:45 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, 13 Jun 2018 14:17:10 +0200
Ju
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b42037e6e81d..d26fd4795aa3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2409,6 +2409,22 @@ void __init cpuset_init_smp(void)
>  	BUG_ON(!cpuset_migrate_mm_wq);
>  }
>  
> +/**
> + * cpuset_lock - Grab the cpuset_mutex from another subsysytem
> + */
> +int cpuset_lock(void)

Shouldn't this be called "cpuset_trylock()" otherwise one may think
that it will always return with the cpuset_mutex locked.

-- Steve


> +{
> +	return mutex_trylock(&cpuset_mutex);
> +}
> +
> +/**
> + * cpuset_unlock - Release the cpuset_mutex from another subsysytem
> + */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked()
  2018-06-14 13:35   ` Steven Rostedt
@ 2018-06-14 13:47     ` Juri Lelli
  0 siblings, 0 replies; 23+ messages in thread
From: Juri Lelli @ 2018-06-14 13:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 09:35, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:08 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:

[...]

> > +/*
> > + * Call with hotplug lock held
> > + */
> > +void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > +			     struct sched_domain_attr *dattr_new)
> > +{
> > +	lockdep_assert_cpus_held();
> 
> Is the above assert really necessary? The assert will happen in
> partition_sched_domains_locked() anyway.

Indeed, it seems of little use.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 13:42     ` Juri Lelli
@ 2018-06-14 13:47       ` Steven Rostedt
  2018-06-14 13:50         ` Juri Lelli
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2018-06-14 13:47 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thu, 14 Jun 2018 15:42:34 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> On 14/06/18 09:33, Steven Rostedt wrote:
> > On Wed, 13 Jun 2018 14:17:07 +0200
> > Juri Lelli <juri.lelli@redhat.com> wrote:
> >   
> > > From: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > 
> > > The comment above function partition_sched_domains() clearly state that
> > > the cpu_hotplug_lock should be held but doesn't mandate one to abide to
> > > it.
> > > 
> > > Add an explicit check backing that comment, so to make it impossible
> > > for anyone to miss the requirement.
> > > 
> > > Suggested-by: Juri Lelli <juri.lelli@redhat.com>
> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > [modified changelog]
> > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>  
> > 
> > Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> > 
> > -- Steve
> >   
> > > ---
> > >  kernel/sched/topology.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index 61a1125c1ae4..96eee22fafe8 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -1858,6 +1858,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > >  	int i, j, n;
> > >  	int new_topology;
> > >  
> > > +	lockdep_assert_cpus_held();
> > >  	mutex_lock(&sched_domains_mutex);
> > >  
> > >  	/* Always unregister in case we don't destroy any domains: */  
> 
> Thanks for reviewing Steve. I thought this should be fine (also looking
> at the comment before the function), but then got this from 0day
> 
> [    2.206129] .................................... done.
> [    2.206662] Using IPI No-Shortcut mode
> [    2.207084] sched_clock: Marking stable (2200014216, 0)->(2435550722, -235536506)
> [    2.208792] registered taskstats version 1
> [    2.209251] page_owner is disabled
> [    2.290164] WARNING: CPU: 0 PID: 13 at kernel/cpu.c:311 lockdep_assert_cpus_held+0x22/0x26
> [    2.291253] Modules linked in:
> [    2.291589] CPU: 0 PID: 13 Comm: cpuhp/0 Not tainted 4.17.0-rc6-00217-gebf44b4 #2
> [    2.292367] EIP: lockdep_assert_cpus_held+0x22/0x26
> [    2.292882] EFLAGS: 00210246 CPU: 0
> [    2.293255] EAX: 00000000 EBX: 00000000 ECX: c1b1fd98 EDX: 00200286
> [    2.293915] ESI: 00000000 EDI: c1b1fcb4 EBP: dd8a3e60 ESP: dd8a3e60
> [    2.294578]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [    2.295128] CR0: 80050033 CR2: 00000000 CR3: 01c9d000 CR4: 000006d0
> [    2.295770] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [    2.296409] DR6: fffe0ff0 DR7: 00000400
> [    2.296802] Call Trace:
> [    2.297066]  partition_sched_domains+0x1b/0x29f
> [    2.297557]  ? dl_cpu_busy+0x1b5/0x1c6
> [    2.297955]  sched_cpu_deactivate+0x8b/0xb6
> [    2.298403]  ? call_rcu_bh+0x17/0x17
> [    2.298786]  ? call_rcu_bh+0x17/0x17
> [    2.299170]  ? rcu_read_lock_bh_held+0x55/0x55
> [    2.299648]  ? sched_cpu_activate+0xcc/0xcc
> [    2.300076]  cpuhp_invoke_callback+0x19d/0x978
> [    2.300537]  cpuhp_thread_fun+0x125/0x16b
> [    2.300949]  smpboot_thread_fn+0x192/0x1a6
> [    2.301377]  kthread+0xfc/0x101
> [    2.301703]  ? sort_range+0x1d/0x1d
> [    2.302062]  ? kthread_flush_work_fn+0x12/0x12
> [    2.302524]  ret_from_fork+0x2e/0x40
> [    2.302891] Code: ff 83 c4 10 5b 5e 5f 5d c3 55 89 e5 e8 ca dc fe ff 83 3d 20 bc b4 c1 00 74 13 83 ca ff b8 98 fd b1 c1 e8 e2 4b 04 00 85 c0 75 02 <0f> 0b 5d c3 55 89 e5 56 53 e8 a2 dc fe ff 89 c6 3b 05 8c 40 bd 
> [    2.304975] irq event stamp: 308
> [    2.305335] hardirqs last  enabled at (307): [<c179d83c>] _raw_spin_unlock_irqrestore+0x4a/0x68
> [    2.306195] hardirqs last disabled at (308): [<c179ed38>] common_exception+0x46/0x6e
> [    2.306969] softirqs last  enabled at (0): [<c10448e6>] copy_process+0x582/0x16c5
> [    2.307720] softirqs last disabled at (0): [<00000000>]   (null)
> [    2.308328] ---[ end trace 7197acb79221f1b9 ]---
> 
> Guess early boot is "special"? :-/

I'm guessing the comment is wrong. I just added the first three patches
and was going to try to trigger the first "bug" I had before. Let me
check if I have lockdep enabled and see if I see the same thing.

-- Steve


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 13:47       ` Steven Rostedt
@ 2018-06-14 13:50         ` Juri Lelli
  2018-06-14 13:58           ` Quentin Perret
  0 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-14 13:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 09:47, Steven Rostedt wrote:
> On Thu, 14 Jun 2018 15:42:34 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > On 14/06/18 09:33, Steven Rostedt wrote:
> > > On Wed, 13 Jun 2018 14:17:07 +0200
> > > Juri Lelli <juri.lelli@redhat.com> wrote:
> > >   
> > > > From: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > > 
> > > > The comment above function partition_sched_domains() clearly state that
> > > > the cpu_hotplug_lock should be held but doesn't mandate one to abide to
> > > > it.
> > > > 
> > > > Add an explicit check backing that comment, so to make it impossible
> > > > for anyone to miss the requirement.
> > > > 
> > > > Suggested-by: Juri Lelli <juri.lelli@redhat.com>
> > > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > > [modified changelog]
> > > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>  
> > > 
> > > Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> > > 
> > > -- Steve
> > >   
> > > > ---
> > > >  kernel/sched/topology.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index 61a1125c1ae4..96eee22fafe8 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -1858,6 +1858,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > > >  	int i, j, n;
> > > >  	int new_topology;
> > > >  
> > > > +	lockdep_assert_cpus_held();
> > > >  	mutex_lock(&sched_domains_mutex);
> > > >  
> > > >  	/* Always unregister in case we don't destroy any domains: */  
> > 
> > Thanks for reviewing Steve. I thought this should be fine (also looking
> > at the comment before the function), but then got this from 0day
> > 
> > [    2.206129] .................................... done.
> > [    2.206662] Using IPI No-Shortcut mode
> > [    2.207084] sched_clock: Marking stable (2200014216, 0)->(2435550722, -235536506)
> > [    2.208792] registered taskstats version 1
> > [    2.209251] page_owner is disabled
> > [    2.290164] WARNING: CPU: 0 PID: 13 at kernel/cpu.c:311 lockdep_assert_cpus_held+0x22/0x26
> > [    2.291253] Modules linked in:
> > [    2.291589] CPU: 0 PID: 13 Comm: cpuhp/0 Not tainted 4.17.0-rc6-00217-gebf44b4 #2
> > [    2.292367] EIP: lockdep_assert_cpus_held+0x22/0x26
> > [    2.292882] EFLAGS: 00210246 CPU: 0
> > [    2.293255] EAX: 00000000 EBX: 00000000 ECX: c1b1fd98 EDX: 00200286
> > [    2.293915] ESI: 00000000 EDI: c1b1fcb4 EBP: dd8a3e60 ESP: dd8a3e60
> > [    2.294578]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > [    2.295128] CR0: 80050033 CR2: 00000000 CR3: 01c9d000 CR4: 000006d0
> > [    2.295770] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > [    2.296409] DR6: fffe0ff0 DR7: 00000400
> > [    2.296802] Call Trace:
> > [    2.297066]  partition_sched_domains+0x1b/0x29f
> > [    2.297557]  ? dl_cpu_busy+0x1b5/0x1c6
> > [    2.297955]  sched_cpu_deactivate+0x8b/0xb6
> > [    2.298403]  ? call_rcu_bh+0x17/0x17
> > [    2.298786]  ? call_rcu_bh+0x17/0x17
> > [    2.299170]  ? rcu_read_lock_bh_held+0x55/0x55
> > [    2.299648]  ? sched_cpu_activate+0xcc/0xcc
> > [    2.300076]  cpuhp_invoke_callback+0x19d/0x978
> > [    2.300537]  cpuhp_thread_fun+0x125/0x16b
> > [    2.300949]  smpboot_thread_fn+0x192/0x1a6
> > [    2.301377]  kthread+0xfc/0x101
> > [    2.301703]  ? sort_range+0x1d/0x1d
> > [    2.302062]  ? kthread_flush_work_fn+0x12/0x12
> > [    2.302524]  ret_from_fork+0x2e/0x40
> > [    2.302891] Code: ff 83 c4 10 5b 5e 5f 5d c3 55 89 e5 e8 ca dc fe ff 83 3d 20 bc b4 c1 00 74 13 83 ca ff b8 98 fd b1 c1 e8 e2 4b 04 00 85 c0 75 02 <0f> 0b 5d c3 55 89 e5 56 53 e8 a2 dc fe ff 89 c6 3b 05 8c 40 bd 
> > [    2.304975] irq event stamp: 308
> > [    2.305335] hardirqs last  enabled at (307): [<c179d83c>] _raw_spin_unlock_irqrestore+0x4a/0x68
> > [    2.306195] hardirqs last disabled at (308): [<c179ed38>] common_exception+0x46/0x6e
> > [    2.306969] softirqs last  enabled at (0): [<c10448e6>] copy_process+0x582/0x16c5
> > [    2.307720] softirqs last disabled at (0): [<00000000>]   (null)
> > [    2.308328] ---[ end trace 7197acb79221f1b9 ]---
> > 
> > Guess early boot is "special"? :-/
> 
> I'm guessing the comment is wrong. I just added the first three patches
> and was going to try to trigger the first "bug" I had before. Let me
> check if I have lockdep enabled and see if I see the same thing.

Mmm, I had it enabled while testing and didn't get the splat.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-06-14 13:45   ` Steven Rostedt
@ 2018-06-14 13:51     ` Juri Lelli
  0 siblings, 0 replies; 23+ messages in thread
From: Juri Lelli @ 2018-06-14 13:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 09:45, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:10 +0200
> Ju
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index b42037e6e81d..d26fd4795aa3 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -2409,6 +2409,22 @@ void __init cpuset_init_smp(void)
> >  	BUG_ON(!cpuset_migrate_mm_wq);
> >  }
> >  
> > +/**
> > + * cpuset_lock - Grab the cpuset_mutex from another subsysytem
> > + */
> > +int cpuset_lock(void)
> 
> Shouldn't this be called "cpuset_trylock()" otherwise one may think
> that it will always return with the cpuset_mutex locked.

Sure.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 13:50         ` Juri Lelli
@ 2018-06-14 13:58           ` Quentin Perret
  2018-06-14 14:11             ` Juri Lelli
  0 siblings, 1 reply; 23+ messages in thread
From: Quentin Perret @ 2018-06-14 13:58 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thursday 14 Jun 2018 at 15:50:40 (+0200), Juri Lelli wrote:
> On 14/06/18 09:47, Steven Rostedt wrote:
> > On Thu, 14 Jun 2018 15:42:34 +0200
> > Juri Lelli <juri.lelli@redhat.com> wrote:
> > 
> > > On 14/06/18 09:33, Steven Rostedt wrote:
> > > > On Wed, 13 Jun 2018 14:17:07 +0200
> > > > Juri Lelli <juri.lelli@redhat.com> wrote:
> > > >   
> > > > > From: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > > > 
> > > > > The comment above function partition_sched_domains() clearly state that
> > > > > the cpu_hotplug_lock should be held but doesn't mandate one to abide to
> > > > > it.
> > > > > 
> > > > > Add an explicit check backing that comment, so to make it impossible
> > > > > for anyone to miss the requirement.
> > > > > 
> > > > > Suggested-by: Juri Lelli <juri.lelli@redhat.com>
> > > > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > > > [modified changelog]
> > > > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>  
> > > > 
> > > > Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> > > > 
> > > > -- Steve
> > > >   
> > > > > ---
> > > > >  kernel/sched/topology.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > > index 61a1125c1ae4..96eee22fafe8 100644
> > > > > --- a/kernel/sched/topology.c
> > > > > +++ b/kernel/sched/topology.c
> > > > > @@ -1858,6 +1858,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > > > >  	int i, j, n;
> > > > >  	int new_topology;
> > > > >  
> > > > > +	lockdep_assert_cpus_held();
> > > > >  	mutex_lock(&sched_domains_mutex);
> > > > >  
> > > > >  	/* Always unregister in case we don't destroy any domains: */  
> > > 
> > > Thanks for reviewing Steve. I thought this should be fine (also looking
> > > at the comment before the function), but then got this from 0day
> > > 
> > > [    2.206129] .................................... done.
> > > [    2.206662] Using IPI No-Shortcut mode
> > > [    2.207084] sched_clock: Marking stable (2200014216, 0)->(2435550722, -235536506)
> > > [    2.208792] registered taskstats version 1
> > > [    2.209251] page_owner is disabled
> > > [    2.290164] WARNING: CPU: 0 PID: 13 at kernel/cpu.c:311 lockdep_assert_cpus_held+0x22/0x26
> > > [    2.291253] Modules linked in:
> > > [    2.291589] CPU: 0 PID: 13 Comm: cpuhp/0 Not tainted 4.17.0-rc6-00217-gebf44b4 #2
> > > [    2.292367] EIP: lockdep_assert_cpus_held+0x22/0x26
> > > [    2.292882] EFLAGS: 00210246 CPU: 0
> > > [    2.293255] EAX: 00000000 EBX: 00000000 ECX: c1b1fd98 EDX: 00200286
> > > [    2.293915] ESI: 00000000 EDI: c1b1fcb4 EBP: dd8a3e60 ESP: dd8a3e60
> > > [    2.294578]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > > [    2.295128] CR0: 80050033 CR2: 00000000 CR3: 01c9d000 CR4: 000006d0
> > > [    2.295770] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > > [    2.296409] DR6: fffe0ff0 DR7: 00000400
> > > [    2.296802] Call Trace:
> > > [    2.297066]  partition_sched_domains+0x1b/0x29f
> > > [    2.297557]  ? dl_cpu_busy+0x1b5/0x1c6
> > > [    2.297955]  sched_cpu_deactivate+0x8b/0xb6
> > > [    2.298403]  ? call_rcu_bh+0x17/0x17
> > > [    2.298786]  ? call_rcu_bh+0x17/0x17
> > > [    2.299170]  ? rcu_read_lock_bh_held+0x55/0x55
> > > [    2.299648]  ? sched_cpu_activate+0xcc/0xcc
> > > [    2.300076]  cpuhp_invoke_callback+0x19d/0x978
> > > [    2.300537]  cpuhp_thread_fun+0x125/0x16b
> > > [    2.300949]  smpboot_thread_fn+0x192/0x1a6
> > > [    2.301377]  kthread+0xfc/0x101
> > > [    2.301703]  ? sort_range+0x1d/0x1d
> > > [    2.302062]  ? kthread_flush_work_fn+0x12/0x12
> > > [    2.302524]  ret_from_fork+0x2e/0x40
> > > [    2.302891] Code: ff 83 c4 10 5b 5e 5f 5d c3 55 89 e5 e8 ca dc fe ff 83 3d 20 bc b4 c1 00 74 13 83 ca ff b8 98 fd b1 c1 e8 e2 4b 04 00 85 c0 75 02 <0f> 0b 5d c3 55 89 e5 56 53 e8 a2 dc fe ff 89 c6 3b 05 8c 40 bd 
> > > [    2.304975] irq event stamp: 308
> > > [    2.305335] hardirqs last  enabled at (307): [<c179d83c>] _raw_spin_unlock_irqrestore+0x4a/0x68
> > > [    2.306195] hardirqs last disabled at (308): [<c179ed38>] common_exception+0x46/0x6e
> > > [    2.306969] softirqs last  enabled at (0): [<c10448e6>] copy_process+0x582/0x16c5
> > > [    2.307720] softirqs last disabled at (0): [<00000000>]   (null)
> > > [    2.308328] ---[ end trace 7197acb79221f1b9 ]---
> > > 
> > > Guess early boot is "special"? :-/
> > 
> > I'm guessing the comment is wrong. I just added the first three patches
> > and was going to try to trigger the first "bug" I had before. Let me
> > check if I have lockdep enabled and see if I see the same thing.
> 
> Mmm, I had it enabled while testing and didn't get the splat.

Hmm not sure if this can help but I think that rebuild_sched_domains()
does _not_ take the hotplug lock before calling partition_sched_domains()
when CONFIG_CPUSETS=n. But it does take it for CONFIG_CPUSETS=y.

Not sure if this is actually the right thing to do, but maybe worth
investigating ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 13:58           ` Quentin Perret
@ 2018-06-14 14:11             ` Juri Lelli
  2018-06-14 14:18               ` Quentin Perret
  0 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-14 14:11 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Steven Rostedt, peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 14:58, Quentin Perret wrote:

[...]

> Hmm not sure if this can help but I think that rebuild_sched_domains()
> does _not_ take the hotplug lock before calling partition_sched_domains()
> when CONFIG_CPUSETS=n. But it does take it for CONFIG_CPUSETS=y.

Did you mean cpuset_mutex?

cpu_hotplug_lock seems to be the problem here..

Can't seem to follow otherwise. :-(

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 14:11             ` Juri Lelli
@ 2018-06-14 14:18               ` Quentin Perret
  2018-06-14 14:30                 ` Juri Lelli
  0 siblings, 1 reply; 23+ messages in thread
From: Quentin Perret @ 2018-06-14 14:18 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thursday 14 Jun 2018 at 16:11:18 (+0200), Juri Lelli wrote:
> On 14/06/18 14:58, Quentin Perret wrote:
> 
> [...]
> 
> > Hmm not sure if this can help but I think that rebuild_sched_domains()
> > does _not_ take the hotplug lock before calling partition_sched_domains()
> > when CONFIG_CPUSETS=n. But it does take it for CONFIG_CPUSETS=y.
> 
> Did you mean cpuset_mutex?

Nope, I really meant the cpu_hotplug_lock !

With CONFIG_CPUSETS=n, rebuild_sched_domains() calls
partition_sched_domains() directly:

https://elixir.bootlin.com/linux/latest/source/include/linux/cpuset.h#L255

But with CONFIG_CPUSETS=y, rebuild_sched_domains() calls,
rebuild_sched_domains_locked(), which calls get_online_cpus() which
calls cpus_read_lock(), which does percpu_down_read(&cpu_hotplug_lock).
And all that happens before calling partition_sched_domains().

So yeah, the point I was trying to make is that there is an inconsistency
here, maybe for a good reason ? Maybe related to the issue you're seeing ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 14:18               ` Quentin Perret
@ 2018-06-14 14:30                 ` Juri Lelli
  2018-06-15  8:39                   ` Quentin Perret
  0 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-14 14:30 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Steven Rostedt, peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 15:18, Quentin Perret wrote:
> On Thursday 14 Jun 2018 at 16:11:18 (+0200), Juri Lelli wrote:
> > On 14/06/18 14:58, Quentin Perret wrote:
> > 
> > [...]
> > 
> > > Hmm not sure if this can help but I think that rebuild_sched_domains()
> > > does _not_ take the hotplug lock before calling partition_sched_domains()
> > > when CONFIG_CPUSETS=n. But it does take it for CONFIG_CPUSETS=y.
> > 
> > Did you mean cpuset_mutex?
> 
> Nope, I really meant the cpu_hotplug_lock !
> 
> With CONFIG_CPUSETS=n, rebuild_sched_domains() calls
> partition_sched_domains() directly:
> 
> https://elixir.bootlin.com/linux/latest/source/include/linux/cpuset.h#L255
> 
> But with CONFIG_CPUSETS=y, rebuild_sched_domains() calls,
> rebuild_sched_domains_locked(), which calls get_online_cpus() which
> calls cpus_read_lock(), which does percpu_down_read(&cpu_hotplug_lock).
> And all that happens before calling partition_sched_domains().

Ah, right!
 
> So yeah, the point I was trying to make is that there is an inconsistency
> here, maybe for a good reason ? Maybe related to the issue you're seeing ?

The config that came with the 0day splat was indeed CONFIG_CPUSETS=n.

So, in this case IIUC we hit the !doms_new branch of partition_sched_
domains, which uses cpu_active_mask (and cpu_possible_mask indirectly).
Should this be still protected by the hotplug lock then?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-06-13 12:17 ` [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
  2018-06-14 13:45   ` Steven Rostedt
@ 2018-06-14 20:11   ` Steven Rostedt
  2018-06-15  7:01     ` Juri Lelli
  1 sibling, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2018-06-14 20:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, 13 Jun 2018 14:17:10 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> +/**
> + * cpuset_lock - Grab the cpuset_mutex from another subsysytem
> + */
> +int cpuset_lock(void)
> +{
> +	return mutex_trylock(&cpuset_mutex);
> +}
> +
> +/**
> + * cpuset_unlock - Release the cpuset_mutex from another subsysytem
> + */
> +void cpuset_unlock(void)
> +{
> +	mutex_unlock(&cpuset_mutex);
> +}
> +
>  /**
>   * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
>   * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ca788f74259d..a5b0c6c25b44 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4218,6 +4218,14 @@ static int __sched_setscheduler(struct task_struct *p,
>  		if (attr->sched_flags & SCHED_FLAG_SUGOV)
>  			return -EINVAL;
>  
> +		/*
> +		 * Make sure we don't race with the cpuset subsystem where root
> +		 * domains can be rebuilt or modified while operations like DL
> +		 * admission checks are carried out.
> +		 */
> +		if (!cpuset_lock())
> +			return -EBUSY;
> +

How important is this for being a trylock? Reason I am asking is that
it is making my deadline test fail to get the resources. What my test
does is to create a bunch of threads, and they try to get the deadline
resources as they are created. This happens in parallel. Thus, when one
gets the cpuset_lock, the others are hitting this and returning -EBUSY.
At first I thought I was over committing somehow but after adding a
trace_printk() in cpuset_lock(), and this fail case it became obvious
to what was happening:

   deadline_test-1376  [004]   107.179693: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017fceeed0, flags: 0x00000000
   deadline_test-1376  [004]   107.179697: bputs:                cpuset_lock: cpuset_mutex trylock
   deadline_test-1377  [003]   107.179763: sys_exit_futex:       0x0
   deadline_test-1377  [003]   107.179767: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017f4eded0, flags: 0x00000000
   deadline_test-1377  [003]   107.179771: bputs:                cpuset_lock: cpuset_mutex trylock
   deadline_test-1377  [003]   107.179771: bputs:                __sched_setscheduler: cpuset_lock
   deadline_test-1377  [003]   107.179773: sys_exit_sched_setattr: 0xfffffffffffffff0

-- Steve

>  		retval = security_task_setscheduler(p);
>  		if (retval)
>  			return retval;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-06-14 20:11   ` Steven Rostedt
@ 2018-06-15  7:01     ` Juri Lelli
  2018-06-15 13:07       ` Juri Lelli
  0 siblings, 1 reply; 23+ messages in thread
From: Juri Lelli @ 2018-06-15  7:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 14/06/18 16:11, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:10 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > +/**
> > + * cpuset_lock - Grab the cpuset_mutex from another subsysytem
> > + */
> > +int cpuset_lock(void)
> > +{
> > +	return mutex_trylock(&cpuset_mutex);
> > +}
> > +
> > +/**
> > + * cpuset_unlock - Release the cpuset_mutex from another subsysytem
> > + */
> > +void cpuset_unlock(void)
> > +{
> > +	mutex_unlock(&cpuset_mutex);
> > +}
> > +
> >  /**
> >   * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
> >   * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ca788f74259d..a5b0c6c25b44 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4218,6 +4218,14 @@ static int __sched_setscheduler(struct task_struct *p,
> >  		if (attr->sched_flags & SCHED_FLAG_SUGOV)
> >  			return -EINVAL;
> >  
> > +		/*
> > +		 * Make sure we don't race with the cpuset subsystem where root
> > +		 * domains can be rebuilt or modified while operations like DL
> > +		 * admission checks are carried out.
> > +		 */
> > +		if (!cpuset_lock())
> > +			return -EBUSY;
> > +
> 
> How important is this for being a trylock? Reason I am asking is that
> it is making my deadline test fail to get the resources. What my test
> does is to create a bunch of threads, and they try to get the deadline
> resources as they are created. This happens in parallel. Thus, when one
> gets the cpuset_lock, the others are hitting this and returning -EBUSY.
> At first I thought I was over committing somehow but after adding a
> trace_printk() in cpuset_lock(), and this fail case it became obvious
> to what was happening:
> 
>    deadline_test-1376  [004]   107.179693: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017fceeed0, flags: 0x00000000
>    deadline_test-1376  [004]   107.179697: bputs:                cpuset_lock: cpuset_mutex trylock
>    deadline_test-1377  [003]   107.179763: sys_exit_futex:       0x0
>    deadline_test-1377  [003]   107.179767: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017f4eded0, flags: 0x00000000
>    deadline_test-1377  [003]   107.179771: bputs:                cpuset_lock: cpuset_mutex trylock
>    deadline_test-1377  [003]   107.179771: bputs:                __sched_setscheduler: cpuset_lock
>    deadline_test-1377  [003]   107.179773: sys_exit_sched_setattr: 0xfffffffffffffff0

Right, not nice.

So, even if we can rework sched_setattr and do_sched_setscheduler to
endup calling __sched_setscheduler outside of RCU read_lock sections (as
Peter suggested) it seems that we might still get into troubles, e.g.

[   39.890366] ======================================================
[   39.892741] WARNING: possible circular locking dependency detected
[   39.894723] 4.17.0-rc6-before+ #169 Not tainted
[   39.896380] ------------------------------------------------------
[   39.898369] bash/2206 is trying to acquire lock:
[   39.899862] 0000000083b596bc (cpu_hotplug_lock.rw_sem){++++}, at: rebuild_sched_domains_locked+0x29/0x660
[   39.902931]
[   39.902931] but task is already holding lock:
[   39.904790] 000000004fc7a567 (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[   39.907263]
[   39.907263] which lock already depends on the new lock.
[   39.907263]
[   39.909936]
[   39.909936] the existing dependency chain (in reverse order) is:
[   39.912479]
[   39.912479] -> #2 (cpuset_mutex){+.+.}:
[   39.914249]        __sched_setscheduler+0xc6/0x880
[   39.915779]        _sched_setscheduler+0x77/0xa0
[   39.917270]        __kthread_create_on_node+0x122/0x1a0
[   39.918948]        kthread_create_on_node+0x46/0x70
[   39.920512]        init_rescuer.part.8+0x49/0xa0
[   39.921688]        workqueue_init+0x1ec/0x2d5
[   39.922659]        kernel_init_freeable+0x23b/0x5b9
[   39.923756]        kernel_init+0xa/0x110
[   39.924656]        ret_from_fork+0x3a/0x50
[   39.925570]
[   39.925570] -> #1 (wq_pool_mutex){+.+.}:
[   39.926591]        apply_workqueue_attrs+0x20/0x50
[   39.927554]        __alloc_workqueue_key+0x1c7/0x490
[   39.928700]        workqueue_init_early+0x257/0x33f
[   39.929685]        start_kernel+0x2fe/0x531
[   39.930530]        secondary_startup_64+0xa5/0xb0
[   39.931336]
[   39.931336] -> #0 (cpu_hotplug_lock.rw_sem){++++}:
[   39.932195]        cpus_read_lock+0x39/0xa0
[   39.932772]        rebuild_sched_domains_locked+0x29/0x660
[   39.933513]        update_flag+0x1fd/0x210
[   39.934075]        cpuset_write_u64+0xe7/0x140
[   39.934708]        cgroup_file_write+0x174/0x230
[   39.935350]        kernfs_fop_write+0x113/0x1a0
[   39.935991]        __vfs_write+0x36/0x180
[   39.936473]        vfs_write+0xc5/0x1b0
[   39.936991]        ksys_write+0x55/0xc0
[   39.937504]        do_syscall_64+0x73/0x4d0
[   39.938061]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.938803]
[   39.938803] other info that might help us debug this:
[   39.938803]
[   39.939708] Chain exists of:
[   39.939708]   cpu_hotplug_lock.rw_sem --> wq_pool_mutex --> cpuset_mutex
[   39.939708]
[   39.940960]  Possible unsafe locking scenario:
[   39.940960]
[   39.941607]        CPU0                    CPU1
[   39.942056]        ----                    ----
[   39.942523]   lock(cpuset_mutex);
[   39.942858]                                lock(wq_pool_mutex);
[   39.943436]                                lock(cpuset_mutex);
[   39.944028]   lock(cpu_hotplug_lock.rw_sem);
[   39.944449]
[   39.944449]  *** DEADLOCK ***
[   39.944449]
[   39.945113] 4 locks held by bash/2206:
[   39.945485]  #0: 000000003cc231db (sb_writers#10){.+.+}, at: vfs_write+0x193/0x1b0
[   39.946246]  #1: 00000000fdc2c059 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
[   39.946980]  #2: 00000000e12e3a5d (kn->count#11){.+.+}, at: kernfs_fop_write+0xeb/0x1a0
[   39.947736]  #3: 000000004fc7a567 (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[   39.948463]
[   39.948463] stack backtrace:
[   39.948871] CPU: 0 PID: 2206 Comm: bash Not tainted 4.17.0-rc6-before+ #169
[   39.949489] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   39.950271] Call Trace:
[   39.950502]  dump_stack+0x67/0x9b
[   39.950810]  print_circular_bug.isra.24+0x1c8/0x2b0
[   39.951252]  __lock_acquire+0x1813/0x18e0
[   39.951618]  ? __lock_acquire+0x1219/0x18e0
[   39.952016]  ? lock_acquire+0x8e/0x240
[   39.952357]  lock_acquire+0x8e/0x240
[   39.952683]  ? rebuild_sched_domains_locked+0x29/0x660
[   39.953179]  cpus_read_lock+0x39/0xa0
[   39.953516]  ? rebuild_sched_domains_locked+0x29/0x660
[   39.953981]  rebuild_sched_domains_locked+0x29/0x660
[   39.954431]  ? mark_held_locks+0x56/0x80
[   39.954790]  ? _raw_spin_unlock_irq+0x29/0x50
[   39.955186]  update_flag+0x1fd/0x210
[   39.955541]  cpuset_write_u64+0xe7/0x140
[   39.955900]  cgroup_file_write+0x174/0x230
[   39.956274]  kernfs_fop_write+0x113/0x1a0
[   39.956640]  __vfs_write+0x36/0x180
[   39.956963]  ? rcu_read_lock_sched_held+0x74/0x80
[   39.957482]  ? rcu_sync_lockdep_assert+0x2e/0x60
[   39.957969]  ? __sb_start_write+0x154/0x1f0
[   39.958415]  ? __sb_start_write+0x16a/0x1f0
[   39.958877]  vfs_write+0xc5/0x1b0
[   39.959310]  ksys_write+0x55/0xc0
[   39.959696]  do_syscall_64+0x73/0x4d0
[   39.960054]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   39.960533]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.961071] RIP: 0033:0x7f534866d780
[   39.961477] RSP: 002b:00007fff54dfa398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   39.962380] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f534866d780
[   39.963166] RDX: 0000000000000002 RSI: 0000000000d97408 RDI: 0000000000000001
[   39.964008] RBP: 0000000000d97408 R08: 000000000000000a R09: 00007f5348f6d700
[   39.964780] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f53489237a0
[   39.965588] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000

Happening the moment one disables sched_load_balance at root level.

I'll try harder to find alternatives, but suggestions are welcome! :-)

Best,

- Juri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock
  2018-06-14 14:30                 ` Juri Lelli
@ 2018-06-15  8:39                   ` Quentin Perret
  0 siblings, 0 replies; 23+ messages in thread
From: Quentin Perret @ 2018-06-15  8:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thursday 14 Jun 2018 at 16:30:37 (+0200), Juri Lelli wrote:
> On 14/06/18 15:18, Quentin Perret wrote:
> > On Thursday 14 Jun 2018 at 16:11:18 (+0200), Juri Lelli wrote:
> > > On 14/06/18 14:58, Quentin Perret wrote:
> > > 
> > > [...]
> > > 
> > > > Hmm not sure if this can help but I think that rebuild_sched_domains()
> > > > does _not_ take the hotplug lock before calling partition_sched_domains()
> > > > when CONFIG_CPUSETS=n. But it does take it for CONFIG_CPUSETS=y.
> > > 
> > > Did you mean cpuset_mutex?
> > 
> > Nope, I really meant the cpu_hotplug_lock !
> > 
> > With CONFIG_CPUSETS=n, rebuild_sched_domains() calls
> > partition_sched_domains() directly:
> > 
> > https://elixir.bootlin.com/linux/latest/source/include/linux/cpuset.h#L255
> > 
> > But with CONFIG_CPUSETS=y, rebuild_sched_domains() calls,
> > rebuild_sched_domains_locked(), which calls get_online_cpus() which
> > calls cpus_read_lock(), which does percpu_down_read(&cpu_hotplug_lock).
> > And all that happens before calling partition_sched_domains().
> 
> Ah, right!
>  
> > So yeah, the point I was trying to make is that there is an inconsistency
> > here, maybe for a good reason ? Maybe related to the issue you're seeing ?
> 
> The config that came with the 0day splat was indeed CONFIG_CPUSETS=n.
> 
> So, in this case IIUC we hit the !doms_new branch of partition_sched_
> domains, which uses cpu_active_mask (and cpu_possible_mask indirectly).
> Should this be still protected by the hotplug lock then?

Hmm I'm not sure ... But looking at your call trace, it seems that the
issue happens when sched_cpu_deactivate() is called (not sure why this
is called during boot BTW ?), which calls cpuset_update_active_cpus().

And again, for CONFIG_CPUSETS=n, that defaults to a raw call to
partition_sched_domain(), but with ndoms_new=1, and no lock taken.
I'm still not sure if this is done like that for a good reason, or if
this is actually an issue that this patch caught nicely ...

Quentin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-06-15  7:01     ` Juri Lelli
@ 2018-06-15 13:07       ` Juri Lelli
  0 siblings, 0 replies; 23+ messages in thread
From: Juri Lelli @ 2018-06-15 13:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 15/06/18 09:01, Juri Lelli wrote:

[...]

> I'll try harder to find alternatives, but suggestions are welcome! :-)

I wonder if something like the following might actually work. IIUC
cpuset.c comment [1], callback_lock is the one to actually take if one
needs to only query cpusets.

[1] https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L266

--->8---
 include/linux/cpuset.h |  4 +--
 kernel/cgroup/cpuset.c | 72 +++++++++++++++++++++++++-------------------------
 kernel/sched/core.c    | 24 +++++++----------
 3 files changed, 48 insertions(+), 52 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a1970862ab8e..4bbb3f5a3020 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,7 +55,7 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
-extern int cpuset_lock(void);
+extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
@@ -178,7 +178,7 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
-static inline int cpuset_lock(void) { return 1; }
+static inline void cpuset_lock(void) { }
 
 static inline void cpuset_unlock(void) { }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d26fd4795aa3..d5a0b4ec31af 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -288,7 +288,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
 
@@ -921,9 +921,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -988,9 +988,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->cpus_allowed as a temp variable */
 	update_cpumasks_hier(cs, trialcs->cpus_allowed);
@@ -1174,9 +1174,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1244,9 +1244,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -1337,9 +1337,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -1754,7 +1754,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -1773,7 +1773,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	return ret;
 }
 
@@ -1988,12 +1988,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -2020,12 +2020,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2064,7 +2064,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
 	mutex_lock(&cpuset_mutex);
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2075,7 +2075,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2173,12 +2173,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 {
 	bool is_empty;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2215,10 +2215,10 @@ hotplug_update_tasks(struct cpuset *cs,
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -2311,21 +2311,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -2412,9 +2412,9 @@ void __init cpuset_init_smp(void)
 /**
  * cpuset_lock - Grab the cpuset_mutex from another subsysytem
  */
-int cpuset_lock(void)
+void cpuset_lock(void)
 {
-	return mutex_trylock(&cpuset_mutex);
+	raw_spin_lock(&callback_lock);
 }
 
 /**
@@ -2422,7 +2422,7 @@ int cpuset_lock(void)
  */
 void cpuset_unlock(void)
 {
-	mutex_unlock(&cpuset_mutex);
+	raw_spin_unlock(&callback_lock);
 }
 
 /**
@@ -2440,11 +2440,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -2492,11 +2492,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	nodemask_t mask;
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -2588,14 +2588,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a5b0c6c25b44..9c5285cc082c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4218,14 +4218,6 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (attr->sched_flags & SCHED_FLAG_SUGOV)
 			return -EINVAL;
 
-		/*
-		 * Make sure we don't race with the cpuset subsystem where root
-		 * domains can be rebuilt or modified while operations like DL
-		 * admission checks are carried out.
-		 */
-		if (!cpuset_lock())
-			return -EBUSY;
-
 		retval = security_task_setscheduler(p);
 		if (retval)
 			return retval;
@@ -4241,6 +4233,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_lock();
+
 	/*
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
@@ -4302,9 +4301,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Re-check policy now with rq lock held: */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
+		cpuset_unlock();
 		task_rq_unlock(rq, p, &rf);
-		if (user)
-			cpuset_unlock();
 		goto recheck;
 	}
 
@@ -4361,9 +4359,8 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
+	cpuset_unlock();
 	task_rq_unlock(rq, p, &rf);
-	if (user)
-		cpuset_unlock();
 
 	if (pi)
 		rt_mutex_adjust_pi(p);
@@ -4375,9 +4372,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	return 0;
 
 unlock:
+	cpuset_unlock();
 	task_rq_unlock(rq, p, &rf);
-	if (user)
-		cpuset_unlock();
 	return retval;
 }

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-06-15 13:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2018-06-13 12:17 ` [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock Juri Lelli
2018-06-14 13:33   ` Steven Rostedt
2018-06-14 13:42     ` Juri Lelli
2018-06-14 13:47       ` Steven Rostedt
2018-06-14 13:50         ` Juri Lelli
2018-06-14 13:58           ` Quentin Perret
2018-06-14 14:11             ` Juri Lelli
2018-06-14 14:18               ` Quentin Perret
2018-06-14 14:30                 ` Juri Lelli
2018-06-15  8:39                   ` Quentin Perret
2018-06-13 12:17 ` [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2018-06-14 13:35   ` Steven Rostedt
2018-06-14 13:47     ` Juri Lelli
2018-06-13 12:17 ` [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2018-06-14 13:42   ` Steven Rostedt
2018-06-13 12:17 ` [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2018-06-14 13:45   ` Steven Rostedt
2018-06-14 13:51     ` Juri Lelli
2018-06-14 20:11   ` Steven Rostedt
2018-06-15  7:01     ` Juri Lelli
2018-06-15 13:07       ` Juri Lelli
2018-06-13 12:17 ` [PATCH v4 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).