[PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting
@ 2019-06-28  8:06 Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

Hi,

v8 of a series of patches, originally authored by Mathieu, with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

Apart from the usual rebase on top of cgroup/for-next, this version
brings in an important difference w.r.t. v7: cpuset_mutex conversion to
percpu_rwsem (as suggested by Peter) to deal with a performance
regression caused by the fact that grabbing the (v7) callback_lock raw
spinlock from sched_setscheduler() was in fact a bottleneck. The
conversion required some lock order changes and an rcu related mod.

So, this is (unfortunately) more than a small update. Hope it still
makes sense, though.

Set also available at

 https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v8

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966

Juri Lelli (6):
  cpuset: Rebuild root domain deadline accounting information
  sched/deadline: Fix bandwidth accounting at all levels after offline
    migration
  cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
  cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()
  rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region

Mathieu Poirier (2):
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()

 include/linux/cgroup.h         |   1 +
 include/linux/cpuset.h         |  13 ++-
 include/linux/sched.h          |   5 +
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |  10 ++
 kernel/cgroup/cgroup.c         |   2 +-
 kernel/cgroup/cpuset.c         | 163 +++++++++++++++++++++++++--------
 kernel/rcu/tree.c              |   6 +-
 kernel/sched/core.c            |  57 ++++++++----
 kernel/sched/deadline.c        |  63 +++++++++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  30 +++++-
 12 files changed, 290 insertions(+), 71 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v8 1/8] sched/topology: Adding function partition_sched_domains_locked()
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cfc0a89a7159..d7166f8c0215 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -161,6 +161,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -213,6 +217,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..362c383ec4bd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2159,16 +2159,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	bool __maybe_unused has_eas = false;
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -2251,6 +2251,15 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..acd6a9fe85bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4222,8 +4222,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4239,8 +4239,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4253,8 +4253,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4289,8 +4289,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4346,6 +4346,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/cgroup.h         |  1 +
 include/linux/sched.h          |  5 +++
 include/linux/sched/deadline.h |  8 +++++
 kernel/cgroup/cgroup.c         |  2 +-
 kernel/cgroup/cpuset.c         | 64 +++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 30 ++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 13 ++++++-
 8 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3745ecdad925..107b8d5943bc 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -150,6 +150,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
+void cgroup_enable_task_cg_lists(void);
 void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..f74738953e70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f582414e15ba..d356905044a2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1879,7 +1879,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
  */
 static bool use_task_css_set_links __read_mostly;
 
-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
 {
 	struct task_struct *p, *g;
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 95da64cc8732..48d29a6112cb 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -45,6 +45,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -947,6 +948,67 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	lockdep_assert_held(&cpuset_mutex);
+	lockdep_assert_cpus_held();
+	lockdep_assert_held(&sched_domains_mutex);
+
+	cgroup_enable_task_cg_lists();
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -984,7 +1046,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43901fa3f269..4cedcf8d6b03 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2283,6 +2283,36 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+
+	raw_spin_unlock(&dl_b->lock);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1ada0be..8607ceb11e8a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,9 +783,6 @@ struct root_domain {
 	struct perf_domain __rcu *pd;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 362c383ec4bd..9fc6ad3c341f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2193,8 +2193,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j]) &&
-			    dattrs_equal(dattr_cur, i, dattr_new, j))
+			    dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (2 preceding siblings ...)
  2019-06-28  8:06 ` [PATCH v8 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4cedcf8d6b03..f0166ab8c6b4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -529,6 +529,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
 static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
 {
 	struct rq *later_rq = NULL;
+	struct dl_bw *dl_b;
 
 	later_rq = find_lock_later_rq(p, rq);
 	if (!later_rq) {
@@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 		double_lock_balance(rq, later_rq);
 	}
 
+	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
+		/*
+		 * Inactive timer is armed (or callback is running, but
+		 * waiting for us to release rq locks). In any case, when it
+		 * will file (or continue), it will see running_bw of this
+		 * task migrated to later_rq (and correctly handle it).
+		 */
+		sub_running_bw(&p->dl, &rq->dl);
+		sub_rq_bw(&p->dl, &rq->dl);
+
+		add_rq_bw(&p->dl, &later_rq->dl);
+		add_running_bw(&p->dl, &later_rq->dl);
+	} else {
+		sub_rq_bw(&p->dl, &rq->dl);
+		add_rq_bw(&p->dl, &later_rq->dl);
+	}
+
+	/*
+	 * And we finally need to fixup root_domain(s) bandwidth accounting,
+	 * since p is still hanging out in the old (now moved to default) root
+	 * domain.
+	 */
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
+	dl_b = &later_rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
 	set_task_cpu(p, later_rq->cpu);
 	double_unlock_balance(later_rq, rq);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (3 preceding siblings ...)
  2019-06-28  8:06 ` [PATCH v8 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-06-28 12:45   ` Peter Zijlstra
  2019-06-28  8:06 ` [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

Holding cpuset_mutex means that cpusets are stable (only the holder can
make changes) and this is required for fixing a synchronization issue
between cpusets and scheduler core.  However, grabbing cpuset_mutex from
setscheduler() hotpath (as implemented in a later patch) is a no-go, as
it would create a bottleneck for tasks concurrently calling
setscheduler().

Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
setscheduler() will then be able to read lock it and avoid concurrency
issues.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/cgroup/cpuset.c | 68 ++++++++++++++++++++++--------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 48d29a6112cb..a7c0c8d8f132 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -333,7 +333,7 @@ static struct cpuset top_cpuset = {
  * guidelines for accessing subsystem state in kernel/cgroup.c
  */
 
-static DEFINE_MUTEX(cpuset_mutex);
+DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
 static DEFINE_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
@@ -966,7 +966,7 @@ static void rebuild_root_domains(void)
 	struct cpuset *cs = NULL;
 	struct cgroup_subsys_state *pos_css;
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 	lockdep_assert_cpus_held();
 	lockdep_assert_held(&sched_domains_mutex);
 
@@ -1026,7 +1026,7 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	int ndoms;
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 	get_online_cpus();
 
 	/*
@@ -1058,9 +1058,9 @@ static void rebuild_sched_domains_locked(void)
 
 void rebuild_sched_domains(void)
 {
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	rebuild_sched_domains_locked();
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /**
@@ -1166,7 +1166,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
 	bool part_error = false;	/* Partition error? */
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 
 	/*
 	 * The parent must be a partition root.
@@ -2154,7 +2154,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
 	cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_read(&cpuset_rwsem);
 
 	/* allow moving tasks into an empty cpuset if on default hierarchy */
 	ret = -ENOSPC;
@@ -2178,7 +2178,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	cs->attach_in_progress++;
 	ret = 0;
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_read(&cpuset_rwsem);
 	return ret;
 }
 
@@ -2188,9 +2188,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_first(tset, &css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_read(&cpuset_rwsem);
 	css_cs(css)->attach_in_progress--;
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_read(&cpuset_rwsem);
 }
 
 /*
@@ -2213,7 +2213,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* prepare for attach */
 	if (cs == &top_cpuset)
@@ -2267,7 +2267,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	if (!cs->attach_in_progress)
 		wake_up(&cpuset_attach_wq);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2298,7 +2298,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = 0;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs)) {
 		retval = -ENODEV;
 		goto out_unlock;
@@ -2334,7 +2334,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 		break;
 	}
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return retval;
 }
 
@@ -2345,7 +2345,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = -ENODEV;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
@@ -2358,7 +2358,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 		break;
 	}
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return retval;
 }
 
@@ -2397,7 +2397,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	kernfs_break_active_protection(of->kn);
 	flush_work(&cpuset_hotplug_work);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
@@ -2421,7 +2421,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 
 	free_cpuset(trialcs);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
 	flush_workqueue(cpuset_migrate_mm_wq);
@@ -2552,13 +2552,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		return -EINVAL;
 
 	css_get(&cs->css);
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
 	retval = update_prstate(cs, val);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	css_put(&cs->css);
 	return retval ?: nbytes;
 }
@@ -2764,7 +2764,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (!parent)
 		return 0;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	set_bit(CS_ONLINE, &cs->flags);
 	if (is_spread_page(parent))
@@ -2815,7 +2815,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return 0;
 }
 
@@ -2834,7 +2834,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	if (is_partition_root(cs))
 		update_prstate(cs, 0);
@@ -2853,7 +2853,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)
@@ -2865,7 +2865,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
@@ -2878,7 +2878,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 	}
 
 	spin_unlock_irq(&callback_lock);
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /*
@@ -2922,6 +2922,8 @@ int __init cpuset_init(void)
 {
 	int err = 0;
 
+	BUG_ON(percpu_init_rwsem(&cpuset_rwsem));
+
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
@@ -2997,7 +2999,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
 		   nodes_empty(cs->mems_allowed);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 
 	/*
 	 * Move tasks to the nearest ancestor with execution resources,
@@ -3007,7 +3009,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 	if (is_empty)
 		remove_tasks_in_empty_cpuset(cs);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 }
 
 static void
@@ -3057,14 +3059,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 retry:
 	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/*
 	 * We have raced with task attaching. We wait until attaching
 	 * is finished, so we won't attach a task to an empty cpuset.
 	 */
 	if (cs->attach_in_progress) {
-		mutex_unlock(&cpuset_mutex);
+		percpu_up_write(&cpuset_rwsem);
 		goto retry;
 	}
 
@@ -3132,7 +3134,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
 					    cpus_updated, mems_updated);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /**
@@ -3162,7 +3164,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	if (on_dfl && !alloc_cpumasks(NULL, &tmp))
 		ptmp = &tmp;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* fetch the available cpus/mems and find out which changed how */
 	cpumask_copy(&new_cpus, cpu_active_mask);
@@ -3212,7 +3214,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		update_tasks_nodemask(&top_cpuset);
 	}
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 
 	/* if cpus or mems changed, we need to propagate to descendants */
 	if (cpus_updated || mems_updated) {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (4 preceding siblings ...)
  2019-06-28  8:06 ` [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-06-28 13:03   ` Peter Zijlstra
  2019-06-28  8:06 ` [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
  2019-06-28  8:06 ` [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
  7 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

cpuset_rwsem is going to be acquired from sched_setscheduler() with a
following patch. There are however paths (e.g., spawn_ksoftirqd) in
which sched_scheduler() is eventually called while holding hotplug lock;
this creates a dependecy between hotplug lock (to be always acquired
first) and cpuset_rwsem (to be always acquired after hotplug lock).

Fix paths which currently take the two locks in the wrong order (after
a following patch is applied).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/cpuset.h |  8 ++++----
 kernel/cgroup/cpuset.c | 22 +++++++++++++++++-----
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..7f1478c26a33 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -40,14 +40,14 @@ static inline bool cpusets_enabled(void)
 
 static inline void cpuset_inc(void)
 {
-	static_branch_inc(&cpusets_pre_enable_key);
-	static_branch_inc(&cpusets_enabled_key);
+	static_branch_inc_cpuslocked(&cpusets_pre_enable_key);
+	static_branch_inc_cpuslocked(&cpusets_enabled_key);
 }
 
 static inline void cpuset_dec(void)
 {
-	static_branch_dec(&cpusets_enabled_key);
-	static_branch_dec(&cpusets_pre_enable_key);
+	static_branch_dec_cpuslocked(&cpusets_enabled_key);
+	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
 }
 
 extern int cpuset_init(void);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a7c0c8d8f132..d92b351f89e3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1026,8 +1026,8 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	int ndoms;
 
+	lockdep_assert_cpus_held();
 	percpu_rwsem_assert_held(&cpuset_rwsem);
-	get_online_cpus();
 
 	/*
 	 * We have raced with CPU hotplug. Don't do anything to avoid
@@ -1036,19 +1036,17 @@ static void rebuild_sched_domains_locked(void)
 	 */
 	if (!top_cpuset.nr_subparts_cpus &&
 	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
-		goto out;
+		return;
 
 	if (top_cpuset.nr_subparts_cpus &&
 	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
-		goto out;
+		return;
 
 	/* Generate domain masks and attrs */
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
 	partition_and_rebuild_sched_domains(ndoms, doms, attr);
-out:
-	put_online_cpus();
 }
 #else /* !CONFIG_SMP */
 static void rebuild_sched_domains_locked(void)
@@ -1058,9 +1056,11 @@ static void rebuild_sched_domains_locked(void)
 
 void rebuild_sched_domains(void)
 {
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	rebuild_sched_domains_locked();
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 }
 
 /**
@@ -2298,6 +2298,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = 0;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs)) {
 		retval = -ENODEV;
@@ -2335,6 +2336,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	}
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return retval;
 }
 
@@ -2345,6 +2347,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = -ENODEV;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2359,6 +2362,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	}
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return retval;
 }
 
@@ -2397,6 +2401,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	kernfs_break_active_protection(of->kn);
 	flush_work(&cpuset_hotplug_work);
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2422,6 +2427,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	free_cpuset(trialcs);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
 	flush_workqueue(cpuset_migrate_mm_wq);
@@ -2552,6 +2558,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		return -EINVAL;
 
 	css_get(&cs->css);
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2559,6 +2566,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 	retval = update_prstate(cs, val);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	css_put(&cs->css);
 	return retval ?: nbytes;
 }
@@ -2764,6 +2772,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (!parent)
 		return 0;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 
 	set_bit(CS_ONLINE, &cs->flags);
@@ -2816,6 +2825,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	spin_unlock_irq(&callback_lock);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return 0;
 }
 
@@ -2834,6 +2844,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 
 	if (is_partition_root(cs))
@@ -2854,6 +2865,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (5 preceding siblings ...)
  2019-06-28  8:06 ` [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-07-01 19:11   ` Peter Zijlstra
  2019-06-28  8:06 ` [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
  7 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

No synchronisation mechanism exists between the cpuset subsystem and
calls to function __sched_setscheduler(). As such, it is possible that
new root domains are created on the cpuset side while a deadline
acceptance test is carried out in __sched_setscheduler(), leading to a
potential oversell of CPU bandwidth.

Grab cpuset_rwsem read lock from core scheduler, so to prevent
situations such as the one described above from happening.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v7->v8: use a percpu_rwsem read lock to avoid hotpath bottleneck issues
---
 include/linux/cpuset.h |  5 +++++
 kernel/cgroup/cpuset.c | 11 +++++++++++
 kernel/sched/core.c    | 33 ++++++++++++++++++++++++++-------
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7f1478c26a33..04c20de66afc 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_lock(void);
+extern void cpuset_read_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,9 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_lock(void) { }
+static inline void cpuset_read_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d92b351f89e3..f47b5c5c3c8b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -334,6 +334,17 @@ static struct cpuset top_cpuset = {
  */
 
 DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
+
+void cpuset_read_lock(void)
+{
+	percpu_down_read(&cpuset_rwsem);
+}
+
+void cpuset_read_unlock(void)
+{
+	percpu_up_read(&cpuset_rwsem);
+}
+
 static DEFINE_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index acd6a9fe85bc..18e151c6b48d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4208,6 +4208,9 @@ static int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
+	if (pi)
+		cpuset_read_lock();
+
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
@@ -4280,6 +4283,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
 		task_rq_unlock(rq, p, &rf);
+		if (pi)
+			cpuset_read_unlock();
 		goto recheck;
 	}
 
@@ -4338,8 +4343,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
 
-	if (pi)
+	if (pi) {
+		cpuset_read_unlock();
 		rt_mutex_adjust_pi(p);
+	}
 
 	/* Run balance callbacks after we've adjusted the PI chain: */
 	balance_callback(rq);
@@ -4349,6 +4356,8 @@ static int __sched_setscheduler(struct task_struct *p,
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
+	if (pi)
+		cpuset_read_unlock();
 	return retval;
 }
 
@@ -4433,10 +4442,15 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setscheduler(p, policy, &lparam);
+	if (!p) {
+		rcu_read_unlock();
+		goto exit;
+	}
+	get_task_struct(p);
 	rcu_read_unlock();
-
+	retval = sched_setscheduler(p, policy, &lparam);
+	put_task_struct(p);
+exit:
 	return retval;
 }
 
@@ -4564,10 +4578,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setattr(p, &attr);
+	if (!p) {
+		rcu_read_unlock();
+		goto exit;
+	}
+	get_task_struct(p);
 	rcu_read_unlock();
-
+	retval = sched_setattr(p, &attr);
+	put_task_struct(p);
+exit:
 	return retval;
 }
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region
  2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (6 preceding siblings ...)
  2019-06-28  8:06 ` [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-06-28  8:06 ` Juri Lelli
  2019-07-01 19:13   ` Peter Zijlstra
  7 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-06-28  8:06 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/rcu/tree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 980ca3ca643f..32ea75acba14 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3123,13 +3123,13 @@ static int __init rcu_spawn_gp_kthread(void)
 	t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
 		return 0;
+	if (kthread_prio)
+		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
 	rnp = rcu_get_root();
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	rcu_state.gp_kthread = t;
-	if (kthread_prio) {
+	if (kthread_prio)
 		sp.sched_priority = kthread_prio;
-		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
-	}
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	wake_up_process(t);
 	rcu_spawn_nocb_kthreads();
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
  2019-06-28  8:06 ` [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
@ 2019-06-28 12:45   ` Peter Zijlstra
  2019-06-28 14:31     ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-06-28 12:45 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Fri, Jun 28, 2019 at 10:06:15AM +0200, Juri Lelli wrote:
> @@ -2154,7 +2154,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>  	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>  	cs = css_cs(css);
>  
> -	mutex_lock(&cpuset_mutex);
> +	percpu_down_read(&cpuset_rwsem);
>  
>  	/* allow moving tasks into an empty cpuset if on default hierarchy */
>  	ret = -ENOSPC;
> @@ -2178,7 +2178,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>  	cs->attach_in_progress++;
>  	ret = 0;
>  out_unlock:
> -	mutex_unlock(&cpuset_mutex);
> +	percpu_up_read(&cpuset_rwsem);
>  	return ret;
>  }
>  
> @@ -2188,9 +2188,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>  
>  	cgroup_taskset_first(tset, &css);
>  
> -	mutex_lock(&cpuset_mutex);
> +	percpu_down_read(&cpuset_rwsem);
>  	css_cs(css)->attach_in_progress--;
> -	mutex_unlock(&cpuset_mutex);
> +	percpu_up_read(&cpuset_rwsem);
>  }

These are the only percpu_down_read()s introduced in this patch; are we
sure this is correct? Specifically, what serializes
->attach_in_progress?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-06-28  8:06 ` [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
@ 2019-06-28 13:03   ` Peter Zijlstra
  2019-07-01  6:52     ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-06-28 13:03 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	Prateek Sood

On Fri, Jun 28, 2019 at 10:06:16AM +0200, Juri Lelli wrote:
> cpuset_rwsem is going to be acquired from sched_setscheduler() with a
> following patch. There are however paths (e.g., spawn_ksoftirqd) in
> which sched_scheduler() is eventually called while holding hotplug lock;
> this creates a dependecy between hotplug lock (to be always acquired
> first) and cpuset_rwsem (to be always acquired after hotplug lock).
> 
> Fix paths which currently take the two locks in the wrong order (after
> a following patch is applied).
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

This all reminds me of this:

  https://lkml.kernel.org/r/1510755615-25906-1-git-send-email-prsood@codeaurora.org

Which sadly got reverted again. If we do this now (I've always been a
proponent), then we can make that rebuild synchronous again, which
should also help here IIRC.

> ---
>  include/linux/cpuset.h |  8 ++++----
>  kernel/cgroup/cpuset.c | 22 +++++++++++++++++-----
>  2 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 934633a05d20..7f1478c26a33 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -40,14 +40,14 @@ static inline bool cpusets_enabled(void)
>  
>  static inline void cpuset_inc(void)
>  {
> -	static_branch_inc(&cpusets_pre_enable_key);
> -	static_branch_inc(&cpusets_enabled_key);
> +	static_branch_inc_cpuslocked(&cpusets_pre_enable_key);
> +	static_branch_inc_cpuslocked(&cpusets_enabled_key);
>  }
>  
>  static inline void cpuset_dec(void)
>  {
> -	static_branch_dec(&cpusets_enabled_key);
> -	static_branch_dec(&cpusets_pre_enable_key);
> +	static_branch_dec_cpuslocked(&cpusets_enabled_key);
> +	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
>  }
>  
>  extern int cpuset_init(void);
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a7c0c8d8f132..d92b351f89e3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1026,8 +1026,8 @@ static void rebuild_sched_domains_locked(void)
>  	cpumask_var_t *doms;
>  	int ndoms;
>  
> +	lockdep_assert_cpus_held();
>  	percpu_rwsem_assert_held(&cpuset_rwsem);
> -	get_online_cpus();
>  
>  	/*
>  	 * We have raced with CPU hotplug. Don't do anything to avoid
> @@ -1036,19 +1036,17 @@ static void rebuild_sched_domains_locked(void)
>  	 */
>  	if (!top_cpuset.nr_subparts_cpus &&
>  	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> -		goto out;
> +		return;
>  
>  	if (top_cpuset.nr_subparts_cpus &&
>  	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> -		goto out;
> +		return;
>  
>  	/* Generate domain masks and attrs */
>  	ndoms = generate_sched_domains(&doms, &attr);
>  
>  	/* Have scheduler rebuild the domains */
>  	partition_and_rebuild_sched_domains(ndoms, doms, attr);
> -out:
> -	put_online_cpus();
>  }
>  #else /* !CONFIG_SMP */
>  static void rebuild_sched_domains_locked(void)
> @@ -1058,9 +1056,11 @@ static void rebuild_sched_domains_locked(void)
>  
>  void rebuild_sched_domains(void)
>  {
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  	rebuild_sched_domains_locked();
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  }
>  
>  /**
> @@ -2298,6 +2298,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
>  	cpuset_filetype_t type = cft->private;
>  	int retval = 0;
>  
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  	if (!is_cpuset_online(cs)) {
>  		retval = -ENODEV;
> @@ -2335,6 +2336,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
>  	}
>  out_unlock:
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  	return retval;
>  }
>  
> @@ -2345,6 +2347,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
>  	cpuset_filetype_t type = cft->private;
>  	int retval = -ENODEV;
>  
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  	if (!is_cpuset_online(cs))
>  		goto out_unlock;
> @@ -2359,6 +2362,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
>  	}
>  out_unlock:
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  	return retval;
>  }
>  
> @@ -2397,6 +2401,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>  	kernfs_break_active_protection(of->kn);
>  	flush_work(&cpuset_hotplug_work);
>  
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  	if (!is_cpuset_online(cs))
>  		goto out_unlock;
> @@ -2422,6 +2427,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>  	free_cpuset(trialcs);
>  out_unlock:
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  	kernfs_unbreak_active_protection(of->kn);
>  	css_put(&cs->css);
>  	flush_workqueue(cpuset_migrate_mm_wq);
> @@ -2552,6 +2558,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
>  		return -EINVAL;
>  
>  	css_get(&cs->css);
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  	if (!is_cpuset_online(cs))
>  		goto out_unlock;
> @@ -2559,6 +2566,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
>  	retval = update_prstate(cs, val);
>  out_unlock:
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  	css_put(&cs->css);
>  	return retval ?: nbytes;
>  }
> @@ -2764,6 +2772,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
>  	if (!parent)
>  		return 0;
>  
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  
>  	set_bit(CS_ONLINE, &cs->flags);
> @@ -2816,6 +2825,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
>  	spin_unlock_irq(&callback_lock);
>  out_unlock:
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  	return 0;
>  }
>  
> @@ -2834,6 +2844,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
>  {
>  	struct cpuset *cs = css_cs(css);
>  
> +	get_online_cpus();
>  	percpu_down_write(&cpuset_rwsem);
>  
>  	if (is_partition_root(cs))
> @@ -2854,6 +2865,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
>  	clear_bit(CS_ONLINE, &cs->flags);
>  
>  	percpu_up_write(&cpuset_rwsem);
> +	put_online_cpus();
>  }
>  
>  static void cpuset_css_free(struct cgroup_subsys_state *css)
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
  2019-06-28 12:45   ` Peter Zijlstra
@ 2019-06-28 14:31     ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-06-28 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

Hi,

On 28/06/19 14:45, Peter Zijlstra wrote:
> On Fri, Jun 28, 2019 at 10:06:15AM +0200, Juri Lelli wrote:
> > @@ -2154,7 +2154,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> >  	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> >  	cs = css_cs(css);
> >  
> > -	mutex_lock(&cpuset_mutex);
> > +	percpu_down_read(&cpuset_rwsem);
> >  
> >  	/* allow moving tasks into an empty cpuset if on default hierarchy */
> >  	ret = -ENOSPC;
> > @@ -2178,7 +2178,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> >  	cs->attach_in_progress++;
> >  	ret = 0;
> >  out_unlock:
> > -	mutex_unlock(&cpuset_mutex);
> > +	percpu_up_read(&cpuset_rwsem);
> >  	return ret;
> >  }
> >  
> > @@ -2188,9 +2188,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
> >  
> >  	cgroup_taskset_first(tset, &css);
> >  
> > -	mutex_lock(&cpuset_mutex);
> > +	percpu_down_read(&cpuset_rwsem);
> >  	css_cs(css)->attach_in_progress--;
> > -	mutex_unlock(&cpuset_mutex);
> > +	percpu_up_read(&cpuset_rwsem);
> >  }
> 
> These are the only percpu_down_read()s introduced in this patch; are we
> sure this is correct? Specifically, what serializes
> ->attach_in_progress?

No, I think it's wrong, sorry. I'll change to the write variant in next
version.

Thanks,

Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-06-28 13:03   ` Peter Zijlstra
@ 2019-07-01  6:52     ` Juri Lelli
  2019-07-01  8:27       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-07-01  6:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	Prateek Sood

Hi,

On 28/06/19 15:03, Peter Zijlstra wrote:
> On Fri, Jun 28, 2019 at 10:06:16AM +0200, Juri Lelli wrote:
> > cpuset_rwsem is going to be acquired from sched_setscheduler() with a
> > following patch. There are however paths (e.g., spawn_ksoftirqd) in
> > which sched_scheduler() is eventually called while holding hotplug lock;
> > this creates a dependecy between hotplug lock (to be always acquired
> > first) and cpuset_rwsem (to be always acquired after hotplug lock).
> > 
> > Fix paths which currently take the two locks in the wrong order (after
> > a following patch is applied).
> > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> 
> This all reminds me of this:
> 
>   https://lkml.kernel.org/r/1510755615-25906-1-git-send-email-prsood@codeaurora.org
> 
> Which sadly got reverted again. If we do this now (I've always been a
> proponent), then we can make that rebuild synchronous again, which
> should also help here IIRC.

Why was that reverted? Perf regression of some type?

Thanks,

Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-01  6:52     ` Juri Lelli
@ 2019-07-01  8:27       ` Peter Zijlstra
  2019-07-01 14:51         ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-01  8:27 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	Prateek Sood

On Mon, Jul 01, 2019 at 08:52:33AM +0200, Juri Lelli wrote:
> Hi,
> 
> On 28/06/19 15:03, Peter Zijlstra wrote:
> > On Fri, Jun 28, 2019 at 10:06:16AM +0200, Juri Lelli wrote:
> > > cpuset_rwsem is going to be acquired from sched_setscheduler() with a
> > > following patch. There are however paths (e.g., spawn_ksoftirqd) in
> > > which sched_scheduler() is eventually called while holding hotplug lock;
> > > this creates a dependecy between hotplug lock (to be always acquired
> > > first) and cpuset_rwsem (to be always acquired after hotplug lock).
> > > 
> > > Fix paths which currently take the two locks in the wrong order (after
> > > a following patch is applied).
> > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > 
> > This all reminds me of this:
> > 
> >   https://lkml.kernel.org/r/1510755615-25906-1-git-send-email-prsood@codeaurora.org
> > 
> > Which sadly got reverted again. If we do this now (I've always been a
> > proponent), then we can make that rebuild synchronous again, which
> > should also help here IIRC.
> 
> Why was that reverted? Perf regression of some type?

IIRC TJ figured it wasn't strictly required to fix the lock invertion at
that time and they sorted it differently. If I (re)read the thread
correctly the other day, he didn't have fundamental objections against
it, but wanted the simpler fix.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-01  8:27       ` Peter Zijlstra
@ 2019-07-01 14:51         ` Tejun Heo
  2019-07-04  8:49           ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2019-07-01 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	Prateek Sood

Hello,

On Mon, Jul 01, 2019 at 10:27:31AM +0200, Peter Zijlstra wrote:
> IIRC TJ figured it wasn't strictly required to fix the lock invertion at
> that time and they sorted it differently. If I (re)read the thread
> correctly the other day, he didn't have fundamental objections against
> it, but wanted the simpler fix.

Yeah I've got no objections to the change itself, it just wasn't
needed at the time.  We've had multiple issues there tho, so please
keep an eye open after the changes get merged.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-06-28  8:06 ` [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-07-01 19:11   ` Peter Zijlstra
  2019-07-02  7:01     ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-01 19:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Fri, Jun 28, 2019 at 10:06:17AM +0200, Juri Lelli wrote:
> No synchronisation mechanism exists between the cpuset subsystem and
> calls to function __sched_setscheduler(). As such, it is possible that
> new root domains are created on the cpuset side while a deadline
> acceptance test is carried out in __sched_setscheduler(), leading to a
> potential oversell of CPU bandwidth.
> 
> Grab cpuset_rwsem read lock from core scheduler, so to prevent
> situations such as the one described above from happening.
> 

ISTR there being a funny vs normalize_rt_tasks(); maybe mention that?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region
  2019-06-28  8:06 ` [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
@ 2019-07-01 19:13   ` Peter Zijlstra
  2019-07-02  7:01     ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-01 19:13 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Fri, Jun 28, 2019 at 10:06:18AM +0200, Juri Lelli wrote:
> sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
> called from an invalid (atomic) context by rcu_spawn_gp_kthread().
> 
> Fix that by simply moving sched_setscheduler_nocheck() call outside of
> the atomic region, as it doesn't actually require to be guarded by
> rcu_node lock.

Maybe move this earlier in the series such that the bug doesn't manifest
in bisection?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-07-01 19:11   ` Peter Zijlstra
@ 2019-07-02  7:01     ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-07-02  7:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 01/07/19 21:11, Peter Zijlstra wrote:
> On Fri, Jun 28, 2019 at 10:06:17AM +0200, Juri Lelli wrote:
> > No synchronisation mechanism exists between the cpuset subsystem and
> > calls to function __sched_setscheduler(). As such, it is possible that
> > new root domains are created on the cpuset side while a deadline
> > acceptance test is carried out in __sched_setscheduler(), leading to a
> > potential oversell of CPU bandwidth.
> > 
> > Grab cpuset_rwsem read lock from core scheduler, so to prevent
> > situations such as the one described above from happening.
> > 
> 
> ISTR there being a funny vs normalize_rt_tasks(); maybe mention that?

Yep. I'll add a comment about it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region
  2019-07-01 19:13   ` Peter Zijlstra
@ 2019-07-02  7:01     ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2019-07-02  7:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 01/07/19 21:13, Peter Zijlstra wrote:
> On Fri, Jun 28, 2019 at 10:06:18AM +0200, Juri Lelli wrote:
> > sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
> > called from an invalid (atomic) context by rcu_spawn_gp_kthread().
> > 
> > Fix that by simply moving sched_setscheduler_nocheck() call outside of
> > the atomic region, as it doesn't actually require to be guarded by
> > rcu_node lock.
> 
> Maybe move this earlier in the series such that the bug doesn't manifest
> in bisection?

OK.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-01 14:51         ` Tejun Heo
@ 2019-07-04  8:49           ` Juri Lelli
  2019-07-12 14:04             ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-07-04  8:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, mingo, rostedt, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	cgroups, Prateek Sood

Hi,

On 01/07/19 07:51, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 01, 2019 at 10:27:31AM +0200, Peter Zijlstra wrote:
> > IIRC TJ figured it wasn't strictly required to fix the lock invertion at
> > that time and they sorted it differently. If I (re)read the thread
> > correctly the other day, he didn't have fundamental objections against
> > it, but wanted the simpler fix.
> 
> Yeah I've got no objections to the change itself, it just wasn't
> needed at the time.  We've had multiple issues there tho, so please
> keep an eye open after the changes get merged.

Should I take this as an indication that you had a look at the set and
(apart from Peter's comments) you are OK with them?

If that's the case I will send a v9 out soon. Otherwise I'd kindly ask
you to please have a look.

Thanks!

Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-04  8:49           ` Juri Lelli
@ 2019-07-12 14:04             ` Juri Lelli
  2019-07-16 15:36               ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2019-07-12 14:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, mingo, rostedt, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	cgroups, Prateek Sood

On 04/07/19 10:49, Juri Lelli wrote:
> Hi,
> 
> On 01/07/19 07:51, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Jul 01, 2019 at 10:27:31AM +0200, Peter Zijlstra wrote:
> > > IIRC TJ figured it wasn't strictly required to fix the lock invertion at
> > > that time and they sorted it differently. If I (re)read the thread
> > > correctly the other day, he didn't have fundamental objections against
> > > it, but wanted the simpler fix.
> > 
> > Yeah I've got no objections to the change itself, it just wasn't
> > needed at the time.  We've had multiple issues there tho, so please
> > keep an eye open after the changes get merged.
> 
> Should I take this as an indication that you had a look at the set and
> (apart from Peter's comments) you are OK with them?
> 
> If that's the case I will send a v9 out soon. Otherwise I'd kindly ask
> you to please have a look.

Gentle ping.

Thanks,

Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-12 14:04             ` Juri Lelli
@ 2019-07-16 15:36               ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2019-07-16 15:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, mingo, rostedt, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	cgroups, Prateek Sood

On Fri, Jul 12, 2019 at 04:04:09PM +0200, Juri Lelli wrote:
> > Should I take this as an indication that you had a look at the set and
> > (apart from Peter's comments) you are OK with them?
> > 
> > If that's the case I will send a v9 out soon. Otherwise I'd kindly ask
> > you to please have a look.
> 
> Gentle ping.

Sorry about the late response.  Yes, I have no objection to the change
but can you please cc Waiman Long on the next round?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-07-16 15:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-28  8:06 [PATCH v8 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2019-06-28  8:06 ` [PATCH v8 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2019-06-28  8:06 ` [PATCH v8 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2019-06-28  8:06 ` [PATCH v8 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
2019-06-28  8:06 ` [PATCH v8 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
2019-06-28  8:06 ` [PATCH v8 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
2019-06-28 12:45   ` Peter Zijlstra
2019-06-28 14:31     ` Juri Lelli
2019-06-28  8:06 ` [PATCH v8 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
2019-06-28 13:03   ` Peter Zijlstra
2019-07-01  6:52     ` Juri Lelli
2019-07-01  8:27       ` Peter Zijlstra
2019-07-01 14:51         ` Tejun Heo
2019-07-04  8:49           ` Juri Lelli
2019-07-12 14:04             ` Juri Lelli
2019-07-16 15:36               ` Tejun Heo
2019-06-28  8:06 ` [PATCH v8 7/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2019-07-01 19:11   ` Peter Zijlstra
2019-07-02  7:01     ` Juri Lelli
2019-06-28  8:06 ` [PATCH v8 8/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
2019-07-01 19:13   ` Peter Zijlstra
2019-07-02  7:01     ` Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).