linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting
@ 2019-07-19 13:59 Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

Hi,

v9 of a series of patches, originally authored by Mathieu, with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

Apart from the usual rebase on top of cgroup/for-next, this version

 - make cpuset_{can,cancel}_attach grab cpuset_rwsem for write (5/8 - Peter)
 - moves v8 8/8 to 7/8 for bisectability (Peter)
 - adds comment in changelog regarding normalize_rt_tasks() (8/8 - Peter)

Set also available at

 https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v9

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966

Juri Lelli (6):
  cpuset: Rebuild root domain deadline accounting information
  sched/deadline: Fix bandwidth accounting at all levels after offline
    migration
  cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
  cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()

Mathieu Poirier (2):
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()

 include/linux/cgroup.h         |   1 +
 include/linux/cpuset.h         |  13 ++-
 include/linux/sched.h          |   5 +
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |  10 ++
 kernel/cgroup/cgroup.c         |   2 +-
 kernel/cgroup/cpuset.c         | 163 +++++++++++++++++++++++++--------
 kernel/rcu/tree.c              |   6 +-
 kernel/sched/core.c            |  57 ++++++++----
 kernel/sched/deadline.c        |  63 +++++++++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  30 +++++-
 12 files changed, 290 insertions(+), 71 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v9 1/8] sched/topology: Adding function partition_sched_domains_locked()
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-25 16:19   ` [tip:sched/core] sched/topology: Add partition_sched_domains_locked() tip-bot for Mathieu Poirier
  2019-07-19 13:59 ` [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cfc0a89a7159..d7166f8c0215 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -161,6 +161,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -213,6 +217,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..362c383ec4bd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2159,16 +2159,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	bool __maybe_unused has_eas = false;
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -2251,6 +2251,15 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-22  8:21   ` Dietmar Eggemann
  2019-07-25 16:20   ` [tip:sched/core] sched/core: Streamle " tip-bot for Mathieu Poirier
  2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..acd6a9fe85bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4222,8 +4222,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4239,8 +4239,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4253,8 +4253,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4289,8 +4289,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4346,6 +4346,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-25 13:53   ` Ingo Molnar
                     ` (3 more replies)
  2019-07-19 13:59 ` [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
                   ` (5 subsequent siblings)
  8 siblings, 4 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/cgroup.h         |  1 +
 include/linux/sched.h          |  5 +++
 include/linux/sched/deadline.h |  8 +++++
 kernel/cgroup/cgroup.c         |  2 +-
 kernel/cgroup/cpuset.c         | 64 +++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 30 ++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 13 ++++++-
 8 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3745ecdad925..107b8d5943bc 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -150,6 +150,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
+void cgroup_enable_task_cg_lists(void);
 void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..f74738953e70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f582414e15ba..d356905044a2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1879,7 +1879,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
  */
 static bool use_task_css_set_links __read_mostly;
 
-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
 {
 	struct task_struct *p, *g;
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 95da64cc8732..48d29a6112cb 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -45,6 +45,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -947,6 +948,67 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	lockdep_assert_held(&cpuset_mutex);
+	lockdep_assert_cpus_held();
+	lockdep_assert_held(&sched_domains_mutex);
+
+	cgroup_enable_task_cg_lists();
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -984,7 +1046,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43901fa3f269..4cedcf8d6b03 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2283,6 +2283,36 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+
+	raw_spin_unlock(&dl_b->lock);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1ada0be..8607ceb11e8a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,9 +783,6 @@ struct root_domain {
 	struct perf_domain __rcu *pd;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 362c383ec4bd..9fc6ad3c341f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2193,8 +2193,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j]) &&
-			    dattrs_equal(dattr_cur, i, dattr_new, j))
+			    dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (2 preceding siblings ...)
  2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-22 11:07   ` Dietmar Eggemann
  2019-07-25 16:21   ` [tip:sched/core] " tip-bot for Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4cedcf8d6b03..f0166ab8c6b4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -529,6 +529,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
 static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
 {
 	struct rq *later_rq = NULL;
+	struct dl_bw *dl_b;
 
 	later_rq = find_lock_later_rq(p, rq);
 	if (!later_rq) {
@@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 		double_lock_balance(rq, later_rq);
 	}
 
+	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
+		/*
+		 * Inactive timer is armed (or callback is running, but
+		 * waiting for us to release rq locks). In any case, when it
+		 * will file (or continue), it will see running_bw of this
+		 * task migrated to later_rq (and correctly handle it).
+		 */
+		sub_running_bw(&p->dl, &rq->dl);
+		sub_rq_bw(&p->dl, &rq->dl);
+
+		add_rq_bw(&p->dl, &later_rq->dl);
+		add_running_bw(&p->dl, &later_rq->dl);
+	} else {
+		sub_rq_bw(&p->dl, &rq->dl);
+		add_rq_bw(&p->dl, &later_rq->dl);
+	}
+
+	/*
+	 * And we finally need to fixup root_domain(s) bandwidth accounting,
+	 * since p is still hanging out in the old (now moved to default) root
+	 * domain.
+	 */
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
+	dl_b = &later_rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
 	set_task_cpu(p, later_rq->cpu);
 	double_unlock_balance(later_rq, rq);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (3 preceding siblings ...)
  2019-07-19 13:59 ` [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-25 16:22   ` [tip:sched/core] cgroup/cpuset: Convert " tip-bot for Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

Holding cpuset_mutex means that cpusets are stable (only the holder can
make changes) and this is required for fixing a synchronization issue
between cpusets and scheduler core.  However, grabbing cpuset_mutex from
setscheduler() hotpath (as implemented in a later patch) is a no-go, as
it would create a bottleneck for tasks concurrently calling
setscheduler().

Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
setscheduler() will then be able to read lock it and avoid concurrency
issues.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---
v8 -> v9:
 - make cpuset_{can,cancel}_attach grab cpuset_rwsem for write (Peter)
---
 kernel/cgroup/cpuset.c | 68 ++++++++++++++++++++++--------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 48d29a6112cb..85491d09f3d3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -333,7 +333,7 @@ static struct cpuset top_cpuset = {
  * guidelines for accessing subsystem state in kernel/cgroup.c
  */
 
-static DEFINE_MUTEX(cpuset_mutex);
+DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
 static DEFINE_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
@@ -966,7 +966,7 @@ static void rebuild_root_domains(void)
 	struct cpuset *cs = NULL;
 	struct cgroup_subsys_state *pos_css;
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 	lockdep_assert_cpus_held();
 	lockdep_assert_held(&sched_domains_mutex);
 
@@ -1026,7 +1026,7 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	int ndoms;
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 	get_online_cpus();
 
 	/*
@@ -1058,9 +1058,9 @@ static void rebuild_sched_domains_locked(void)
 
 void rebuild_sched_domains(void)
 {
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	rebuild_sched_domains_locked();
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /**
@@ -1166,7 +1166,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
 	bool part_error = false;	/* Partition error? */
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 
 	/*
 	 * The parent must be a partition root.
@@ -2154,7 +2154,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
 	cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* allow moving tasks into an empty cpuset if on default hierarchy */
 	ret = -ENOSPC;
@@ -2178,7 +2178,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	cs->attach_in_progress++;
 	ret = 0;
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return ret;
 }
 
@@ -2188,9 +2188,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_first(tset, &css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	css_cs(css)->attach_in_progress--;
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /*
@@ -2213,7 +2213,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* prepare for attach */
 	if (cs == &top_cpuset)
@@ -2267,7 +2267,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	if (!cs->attach_in_progress)
 		wake_up(&cpuset_attach_wq);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2298,7 +2298,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = 0;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs)) {
 		retval = -ENODEV;
 		goto out_unlock;
@@ -2334,7 +2334,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 		break;
 	}
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return retval;
 }
 
@@ -2345,7 +2345,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = -ENODEV;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
@@ -2358,7 +2358,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 		break;
 	}
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return retval;
 }
 
@@ -2397,7 +2397,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	kernfs_break_active_protection(of->kn);
 	flush_work(&cpuset_hotplug_work);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
@@ -2421,7 +2421,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 
 	free_cpuset(trialcs);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
 	flush_workqueue(cpuset_migrate_mm_wq);
@@ -2552,13 +2552,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		return -EINVAL;
 
 	css_get(&cs->css);
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
 	retval = update_prstate(cs, val);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	css_put(&cs->css);
 	return retval ?: nbytes;
 }
@@ -2764,7 +2764,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (!parent)
 		return 0;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	set_bit(CS_ONLINE, &cs->flags);
 	if (is_spread_page(parent))
@@ -2815,7 +2815,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return 0;
 }
 
@@ -2834,7 +2834,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	if (is_partition_root(cs))
 		update_prstate(cs, 0);
@@ -2853,7 +2853,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)
@@ -2865,7 +2865,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
@@ -2878,7 +2878,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 	}
 
 	spin_unlock_irq(&callback_lock);
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /*
@@ -2922,6 +2922,8 @@ int __init cpuset_init(void)
 {
 	int err = 0;
 
+	BUG_ON(percpu_init_rwsem(&cpuset_rwsem));
+
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
@@ -2997,7 +2999,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
 		   nodes_empty(cs->mems_allowed);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 
 	/*
 	 * Move tasks to the nearest ancestor with execution resources,
@@ -3007,7 +3009,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 	if (is_empty)
 		remove_tasks_in_empty_cpuset(cs);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 }
 
 static void
@@ -3057,14 +3059,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 retry:
 	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/*
 	 * We have raced with task attaching. We wait until attaching
 	 * is finished, so we won't attach a task to an empty cpuset.
 	 */
 	if (cs->attach_in_progress) {
-		mutex_unlock(&cpuset_mutex);
+		percpu_up_write(&cpuset_rwsem);
 		goto retry;
 	}
 
@@ -3132,7 +3134,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 		hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
 					    cpus_updated, mems_updated);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /**
@@ -3162,7 +3164,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	if (on_dfl && !alloc_cpumasks(NULL, &tmp))
 		ptmp = &tmp;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* fetch the available cpus/mems and find out which changed how */
 	cpumask_copy(&new_cpus, cpu_active_mask);
@@ -3212,7 +3214,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		update_tasks_nodemask(&top_cpuset);
 	}
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 
 	/* if cpus or mems changed, we need to propagate to descendants */
 	if (cpus_updated || mems_updated) {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (4 preceding siblings ...)
  2019-07-19 13:59 ` [PATCH v9 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-25 16:23   ` [tip:sched/core] " tip-bot for Juri Lelli
  2019-07-19 13:59 ` [PATCH v9 7/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

cpuset_rwsem is going to be acquired from sched_setscheduler() with a
following patch. There are however paths (e.g., spawn_ksoftirqd) in
which sched_scheduler() is eventually called while holding hotplug lock;
this creates a dependecy between hotplug lock (to be always acquired
first) and cpuset_rwsem (to be always acquired after hotplug lock).

Fix paths which currently take the two locks in the wrong order (after
a following patch is applied).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/cpuset.h |  8 ++++----
 kernel/cgroup/cpuset.c | 22 +++++++++++++++++-----
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..7f1478c26a33 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -40,14 +40,14 @@ static inline bool cpusets_enabled(void)
 
 static inline void cpuset_inc(void)
 {
-	static_branch_inc(&cpusets_pre_enable_key);
-	static_branch_inc(&cpusets_enabled_key);
+	static_branch_inc_cpuslocked(&cpusets_pre_enable_key);
+	static_branch_inc_cpuslocked(&cpusets_enabled_key);
 }
 
 static inline void cpuset_dec(void)
 {
-	static_branch_dec(&cpusets_enabled_key);
-	static_branch_dec(&cpusets_pre_enable_key);
+	static_branch_dec_cpuslocked(&cpusets_enabled_key);
+	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
 }
 
 extern int cpuset_init(void);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 85491d09f3d3..93414ea63252 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1026,8 +1026,8 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	int ndoms;
 
+	lockdep_assert_cpus_held();
 	percpu_rwsem_assert_held(&cpuset_rwsem);
-	get_online_cpus();
 
 	/*
 	 * We have raced with CPU hotplug. Don't do anything to avoid
@@ -1036,19 +1036,17 @@ static void rebuild_sched_domains_locked(void)
 	 */
 	if (!top_cpuset.nr_subparts_cpus &&
 	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
-		goto out;
+		return;
 
 	if (top_cpuset.nr_subparts_cpus &&
 	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
-		goto out;
+		return;
 
 	/* Generate domain masks and attrs */
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
 	partition_and_rebuild_sched_domains(ndoms, doms, attr);
-out:
-	put_online_cpus();
 }
 #else /* !CONFIG_SMP */
 static void rebuild_sched_domains_locked(void)
@@ -1058,9 +1056,11 @@ static void rebuild_sched_domains_locked(void)
 
 void rebuild_sched_domains(void)
 {
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	rebuild_sched_domains_locked();
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 }
 
 /**
@@ -2298,6 +2298,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = 0;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs)) {
 		retval = -ENODEV;
@@ -2335,6 +2336,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	}
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return retval;
 }
 
@@ -2345,6 +2347,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = -ENODEV;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2359,6 +2362,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	}
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return retval;
 }
 
@@ -2397,6 +2401,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	kernfs_break_active_protection(of->kn);
 	flush_work(&cpuset_hotplug_work);
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2422,6 +2427,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	free_cpuset(trialcs);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
 	flush_workqueue(cpuset_migrate_mm_wq);
@@ -2552,6 +2558,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		return -EINVAL;
 
 	css_get(&cs->css);
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2559,6 +2566,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 	retval = update_prstate(cs, val);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	css_put(&cs->css);
 	return retval ?: nbytes;
 }
@@ -2764,6 +2772,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (!parent)
 		return 0;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 
 	set_bit(CS_ONLINE, &cs->flags);
@@ -2816,6 +2825,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	spin_unlock_irq(&callback_lock);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return 0;
 }
 
@@ -2834,6 +2844,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 
 	if (is_partition_root(cs))
@@ -2854,6 +2865,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 7/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (5 preceding siblings ...)
  2019-07-19 13:59 ` [PATCH v9 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
@ 2019-07-19 13:59 ` Juri Lelli
  2019-07-25 16:24   ` [tip:sched/core] rcu/tree: Call setschedule() " tip-bot for Juri Lelli
  2019-07-19 14:00 ` [PATCH v9 8/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
  2019-07-24  8:45 ` [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Dietmar Eggemann
  8 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 13:59 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/rcu/tree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 980ca3ca643f..32ea75acba14 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3123,13 +3123,13 @@ static int __init rcu_spawn_gp_kthread(void)
 	t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
 		return 0;
+	if (kthread_prio)
+		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
 	rnp = rcu_get_root();
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	rcu_state.gp_kthread = t;
-	if (kthread_prio) {
+	if (kthread_prio)
 		sp.sched_priority = kthread_prio;
-		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
-	}
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	wake_up_process(t);
 	rcu_spawn_nocb_kthreads();
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 8/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (6 preceding siblings ...)
  2019-07-19 13:59 ` [PATCH v9 7/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
@ 2019-07-19 14:00 ` Juri Lelli
  2019-07-25 16:24   ` [tip:sched/core] " tip-bot for Juri Lelli
  2019-07-24  8:45 ` [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Dietmar Eggemann
  8 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-19 14:00 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, dietmar.eggemann, cgroups,
	Juri Lelli

No synchronisation mechanism exists between the cpuset subsystem and
calls to function __sched_setscheduler(). As such, it is possible that
new root domains are created on the cpuset side while a deadline
acceptance test is carried out in __sched_setscheduler(), leading to a
potential oversell of CPU bandwidth.

Grab cpuset_rwsem read lock from core scheduler, so to prevent
situations such as the one described above from happening.

The only exception is normalize_rt_tasks() which needs to work under
tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
this, as this function is only called by sysrq and, if that gets
triggered, DEADLINE guarantees are already gone out of the window
anyway.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v8 -> v9:
 - Add comment in changelog regarding normalize_rt_tasks() (Peter)
---
 include/linux/cpuset.h |  5 +++++
 kernel/cgroup/cpuset.c | 11 +++++++++++
 kernel/sched/core.c    | 33 ++++++++++++++++++++++++++-------
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7f1478c26a33..04c20de66afc 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_lock(void);
+extern void cpuset_read_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,9 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_lock(void) { }
+static inline void cpuset_read_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 93414ea63252..d16e160a3b35 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -334,6 +334,17 @@ static struct cpuset top_cpuset = {
  */
 
 DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
+
+void cpuset_read_lock(void)
+{
+	percpu_down_read(&cpuset_rwsem);
+}
+
+void cpuset_read_unlock(void)
+{
+	percpu_up_read(&cpuset_rwsem);
+}
+
 static DEFINE_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index acd6a9fe85bc..18e151c6b48d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4208,6 +4208,9 @@ static int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
+	if (pi)
+		cpuset_read_lock();
+
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
@@ -4280,6 +4283,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
 		task_rq_unlock(rq, p, &rf);
+		if (pi)
+			cpuset_read_unlock();
 		goto recheck;
 	}
 
@@ -4338,8 +4343,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
 
-	if (pi)
+	if (pi) {
+		cpuset_read_unlock();
 		rt_mutex_adjust_pi(p);
+	}
 
 	/* Run balance callbacks after we've adjusted the PI chain: */
 	balance_callback(rq);
@@ -4349,6 +4356,8 @@ static int __sched_setscheduler(struct task_struct *p,
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
+	if (pi)
+		cpuset_read_unlock();
 	return retval;
 }
 
@@ -4433,10 +4442,15 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setscheduler(p, policy, &lparam);
+	if (!p) {
+		rcu_read_unlock();
+		goto exit;
+	}
+	get_task_struct(p);
 	rcu_read_unlock();
-
+	retval = sched_setscheduler(p, policy, &lparam);
+	put_task_struct(p);
+exit:
 	return retval;
 }
 
@@ -4564,10 +4578,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setattr(p, &attr);
+	if (!p) {
+		rcu_read_unlock();
+		goto exit;
+	}
+	get_task_struct(p);
 	rcu_read_unlock();
-
+	retval = sched_setattr(p, &attr);
+	put_task_struct(p);
+exit:
 	return retval;
 }
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-19 13:59 ` [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2019-07-22  8:21   ` Dietmar Eggemann
  2019-07-22  8:32     ` Juri Lelli
  2019-07-25 16:20   ` [tip:sched/core] sched/core: Streamle " tip-bot for Mathieu Poirier
  1 sibling, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2019-07-22  8:21 UTC (permalink / raw)
  To: Juri Lelli, peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, cgroups

On 7/19/19 3:59 PM, Juri Lelli wrote:
> From: Mathieu Poirier <mathieu.poirier@linaro.org>

[...]

> @@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
>  			 */
>  			if (!cpumask_subset(span, &p->cpus_allowed) ||

This doesn't apply cleanly on v5.3-rc1 anymore due to commit
3bd3706251ee ("sched/core: Provide a pointer to the valid CPU mask").

>  			    rq->rd->dl_bw.bw == 0) {
> -				task_rq_unlock(rq, p, &rf);
> -				return -EPERM;
> +				retval = -EPERM;
> +				goto unlock;
>  			}
>  		}
>  #endif

[...]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-22  8:21   ` Dietmar Eggemann
@ 2019-07-22  8:32     ` Juri Lelli
  2019-07-22  9:08       ` Dietmar Eggemann
  2019-07-23 10:31       ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-22  8:32 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	cgroups

On 22/07/19 10:21, Dietmar Eggemann wrote:
> On 7/19/19 3:59 PM, Juri Lelli wrote:
> > From: Mathieu Poirier <mathieu.poirier@linaro.org>
> 
> [...]
> 
> > @@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
> >  			 */
> >  			if (!cpumask_subset(span, &p->cpus_allowed) ||
> 
> This doesn't apply cleanly on v5.3-rc1 anymore due to commit
> 3bd3706251ee ("sched/core: Provide a pointer to the valid CPU mask").
> 
> >  			    rq->rd->dl_bw.bw == 0) {
> > -				task_rq_unlock(rq, p, &rf);
> > -				return -EPERM;
> > +				retval = -EPERM;
> > +				goto unlock;
> >  			}
> >  		}
> >  #endif

Thanks for reporting. The set is based on cgroup/for-next (as of last
week), though. I can of course rebase on tip/sched/core or mainline if
needed.

Best,

Juri


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-22  8:32     ` Juri Lelli
@ 2019-07-22  9:08       ` Dietmar Eggemann
  2019-07-23 10:32         ` Peter Zijlstra
  2019-07-23 10:31       ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2019-07-22  9:08 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	cgroups

On 7/22/19 10:32 AM, Juri Lelli wrote:
> On 22/07/19 10:21, Dietmar Eggemann wrote:
>> On 7/19/19 3:59 PM, Juri Lelli wrote:
>>> From: Mathieu Poirier <mathieu.poirier@linaro.org>
>>
>> [...]
>>
>>> @@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
>>>  			 */
>>>  			if (!cpumask_subset(span, &p->cpus_allowed) ||
>>
>> This doesn't apply cleanly on v5.3-rc1 anymore due to commit
>> 3bd3706251ee ("sched/core: Provide a pointer to the valid CPU mask").
>>
>>>  			    rq->rd->dl_bw.bw == 0) {
>>> -				task_rq_unlock(rq, p, &rf);
>>> -				return -EPERM;
>>> +				retval = -EPERM;
>>> +				goto unlock;
>>>  			}
>>>  		}
>>>  #endif
> 
> Thanks for reporting. The set is based on cgroup/for-next (as of last
> week), though. I can of course rebase on tip/sched/core or mainline if
> needed.

Not sure, there is another little issue on 3/8 since uclamp is in
v5.3-rc1 as well commit 69842cba9ace8 ("sched/uclamp: Add CPU's clamp
buckets refcounting").

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-19 13:59 ` [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
@ 2019-07-22 11:07   ` Dietmar Eggemann
  2019-07-22 12:28     ` Juri Lelli
  2019-07-25 16:21   ` [tip:sched/core] " tip-bot for Juri Lelli
  1 sibling, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2019-07-22 11:07 UTC (permalink / raw)
  To: Juri Lelli, peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, cgroups

On 7/19/19 3:59 PM, Juri Lelli wrote:

[...]

> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
>  		double_lock_balance(rq, later_rq);
>  	}
>  
> +	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
> +		/*
> +		 * Inactive timer is armed (or callback is running, but
> +		 * waiting for us to release rq locks). In any case, when it
> +		 * will file (or continue), it will see running_bw of this

s/file/fire ?

> +		 * task migrated to later_rq (and correctly handle it).

Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
setting dl_se->dl_non_contending = 0 ?

[...]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-22 11:07   ` Dietmar Eggemann
@ 2019-07-22 12:28     ` Juri Lelli
  2019-07-22 13:21       ` Dietmar Eggemann
  0 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-22 12:28 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	cgroups

On 22/07/19 13:07, Dietmar Eggemann wrote:
> On 7/19/19 3:59 PM, Juri Lelli wrote:
> 
> [...]
> 
> > @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
> >  		double_lock_balance(rq, later_rq);
> >  	}
> >  
> > +	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
> > +		/*
> > +		 * Inactive timer is armed (or callback is running, but
> > +		 * waiting for us to release rq locks). In any case, when it
> > +		 * will file (or continue), it will see running_bw of this
> 
> s/file/fire ?

Yep.

> > +		 * task migrated to later_rq (and correctly handle it).
> 
> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
> setting dl_se->dl_non_contending = 0 ?

No, this is related to inactive_task_timer() callback. Since the task is
migrated (by this function calling set_task_cpu()) because a CPU hotplug
operation happened, we need to reflect this w.r.t. running_bw, or
inactive_task_timer() might sub from the new CPU and cause running_bw to
underflow.

Thanks,

Juri

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-22 12:28     ` Juri Lelli
@ 2019-07-22 13:21       ` Dietmar Eggemann
  2019-07-22 13:35         ` Juri Lelli
  0 siblings, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2019-07-22 13:21 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	cgroups

On 7/22/19 2:28 PM, Juri Lelli wrote:
> On 22/07/19 13:07, Dietmar Eggemann wrote:
>> On 7/19/19 3:59 PM, Juri Lelli wrote:
>>
>> [...]
>>
>>> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
>>>  		double_lock_balance(rq, later_rq);
>>>  	}
>>>  
>>> +	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
>>> +		/*
>>> +		 * Inactive timer is armed (or callback is running, but
>>> +		 * waiting for us to release rq locks). In any case, when it
>>> +		 * will file (or continue), it will see running_bw of this
>>
>> s/file/fire ?
> 
> Yep.
> 
>>> +		 * task migrated to later_rq (and correctly handle it).
>>
>> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
>> setting dl_se->dl_non_contending = 0 ?
> 
> No, this is related to inactive_task_timer() callback. Since the task is
> migrated (by this function calling set_task_cpu()) because a CPU hotplug
> operation happened, we need to reflect this w.r.t. running_bw, or
> inactive_task_timer() might sub from the new CPU and cause running_bw to
> underflow.

I was more referring to the '... it will see running_bw of thus task
migrated to later_rq ...) and specifically to the HOW the timer
callback can detect this. I should have made this clearer.
inactive_task_timer() checks if (dl_se->dl_non_contending == 0) so I
thought I have to find the place where dl_se->dl_non_contending is set 0?




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-22 13:21       ` Dietmar Eggemann
@ 2019-07-22 13:35         ` Juri Lelli
  2019-07-22 14:29           ` Dietmar Eggemann
  0 siblings, 1 reply; 42+ messages in thread
From: Juri Lelli @ 2019-07-22 13:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	cgroups

On 22/07/19 15:21, Dietmar Eggemann wrote:
> On 7/22/19 2:28 PM, Juri Lelli wrote:
> > On 22/07/19 13:07, Dietmar Eggemann wrote:
> >> On 7/19/19 3:59 PM, Juri Lelli wrote:
> >>
> >> [...]
> >>
> >>> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
> >>>  		double_lock_balance(rq, later_rq);
> >>>  	}
> >>>  
> >>> +	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
> >>> +		/*
> >>> +		 * Inactive timer is armed (or callback is running, but
> >>> +		 * waiting for us to release rq locks). In any case, when it
> >>> +		 * will file (or continue), it will see running_bw of this
> >>
> >> s/file/fire ?
> > 
> > Yep.
> > 
> >>> +		 * task migrated to later_rq (and correctly handle it).
> >>
> >> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
> >> setting dl_se->dl_non_contending = 0 ?
> > 
> > No, this is related to inactive_task_timer() callback. Since the task is
> > migrated (by this function calling set_task_cpu()) because a CPU hotplug
> > operation happened, we need to reflect this w.r.t. running_bw, or
> > inactive_task_timer() might sub from the new CPU and cause running_bw to
> > underflow.
> 
> I was more referring to the '... it will see running_bw of thus task
> migrated to later_rq ...) and specifically to the HOW the timer
> callback can detect this.

Oh, it actually doesn't "actively" detect this condition. The problem is
that if it still sees dl_non_contending == 1, it will sub (from the
"new" rq to which task's running_bw hasn't been added - w/o this fix)
and cause the underflow.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-22 13:35         ` Juri Lelli
@ 2019-07-22 14:29           ` Dietmar Eggemann
  0 siblings, 0 replies; 42+ messages in thread
From: Dietmar Eggemann @ 2019-07-22 14:29 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	cgroups

On 7/22/19 3:35 PM, Juri Lelli wrote:
> On 22/07/19 15:21, Dietmar Eggemann wrote:
>> On 7/22/19 2:28 PM, Juri Lelli wrote:
>>> On 22/07/19 13:07, Dietmar Eggemann wrote:
>>>> On 7/19/19 3:59 PM, Juri Lelli wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
>>>>>  		double_lock_balance(rq, later_rq);
>>>>>  	}
>>>>>  
>>>>> +	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
>>>>> +		/*
>>>>> +		 * Inactive timer is armed (or callback is running, but
>>>>> +		 * waiting for us to release rq locks). In any case, when it
>>>>> +		 * will file (or continue), it will see running_bw of this
>>>>
>>>> s/file/fire ?
>>>
>>> Yep.
>>>
>>>>> +		 * task migrated to later_rq (and correctly handle it).
>>>>
>>>> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
>>>> setting dl_se->dl_non_contending = 0 ?
>>>
>>> No, this is related to inactive_task_timer() callback. Since the task is
>>> migrated (by this function calling set_task_cpu()) because a CPU hotplug
>>> operation happened, we need to reflect this w.r.t. running_bw, or
>>> inactive_task_timer() might sub from the new CPU and cause running_bw to
>>> underflow.
>>
>> I was more referring to the '... it will see running_bw of thus task
>> migrated to later_rq ...) and specifically to the HOW the timer
>> callback can detect this.
> 
> Oh, it actually doesn't "actively" detect this condition. The problem is
> that if it still sees dl_non_contending == 1, it will sub (from the
> "new" rq to which task's running_bw hasn't been added - w/o this fix)
> and cause the underflow.

I was wrong ... enqueue_task_dl() is called with ENQUEUE_REPLENISH which
doesn't call task_contending(). The comment makes sense to me now.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-22  8:32     ` Juri Lelli
  2019-07-22  9:08       ` Dietmar Eggemann
@ 2019-07-23 10:31       ` Peter Zijlstra
  2019-07-23 13:11         ` Tejun Heo
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2019-07-23 10:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Dietmar Eggemann, mingo, rostedt, tj, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	longman, cgroups

On Mon, Jul 22, 2019 at 10:32:14AM +0200, Juri Lelli wrote:

> Thanks for reporting. The set is based on cgroup/for-next (as of last
> week), though. I can of course rebase on tip/sched/core or mainline if
> needed.

TJ; I would like to take these patches through the scheduler tree if you
don't mind. Afaict there's no real conflict vs cgroup/for-next (I
applied the patches and then did a pull of cgroup/for-next which
finished without complaints).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-22  9:08       ` Dietmar Eggemann
@ 2019-07-23 10:32         ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2019-07-23 10:32 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Juri Lelli, mingo, rostedt, tj, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	longman, cgroups

On Mon, Jul 22, 2019 at 11:08:17AM +0200, Dietmar Eggemann wrote:

> Not sure, there is another little issue on 3/8 since uclamp is in
> v5.3-rc1 as well commit 69842cba9ace8 ("sched/uclamp: Add CPU's clamp
> buckets refcounting").

Also, 8/8, but all conflicts are trivial and I've fixed them up.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-23 10:31       ` Peter Zijlstra
@ 2019-07-23 13:11         ` Tejun Heo
  2019-07-23 14:58           ` Juri Lelli
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2019-07-23 13:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Dietmar Eggemann, mingo, rostedt, linux-kernel,
	luca.abeni, claudio, tommaso.cucinotta, bristot, mathieu.poirier,
	lizefan, longman, cgroups

On Tue, Jul 23, 2019 at 12:31:31PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 22, 2019 at 10:32:14AM +0200, Juri Lelli wrote:
> 
> > Thanks for reporting. The set is based on cgroup/for-next (as of last
> > week), though. I can of course rebase on tip/sched/core or mainline if
> > needed.
> 
> TJ; I would like to take these patches through the scheduler tree if you
> don't mind. Afaict there's no real conflict vs cgroup/for-next (I
> applied the patches and then did a pull of cgroup/for-next which
> finished without complaints).

Yeah, for sure, please go ahead.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()
  2019-07-23 13:11         ` Tejun Heo
@ 2019-07-23 14:58           ` Juri Lelli
  0 siblings, 0 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-23 14:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Dietmar Eggemann, mingo, rostedt, linux-kernel,
	luca.abeni, claudio, tommaso.cucinotta, bristot, mathieu.poirier,
	lizefan, longman, cgroups

On 23/07/19 06:11, Tejun Heo wrote:
> On Tue, Jul 23, 2019 at 12:31:31PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 22, 2019 at 10:32:14AM +0200, Juri Lelli wrote:
> > 
> > > Thanks for reporting. The set is based on cgroup/for-next (as of last
> > > week), though. I can of course rebase on tip/sched/core or mainline if
> > > needed.
> > 
> > TJ; I would like to take these patches through the scheduler tree if you
> > don't mind. Afaict there's no real conflict vs cgroup/for-next (I
> > applied the patches and then did a pull of cgroup/for-next which
> > finished without complaints).
> 
> Yeah, for sure, please go ahead.
> 
> Thanks.

Thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting
  2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (7 preceding siblings ...)
  2019-07-19 14:00 ` [PATCH v9 8/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-07-24  8:45 ` Dietmar Eggemann
  8 siblings, 0 replies; 42+ messages in thread
From: Dietmar Eggemann @ 2019-07-24  8:45 UTC (permalink / raw)
  To: Juri Lelli, peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, longman, cgroups

On 7/19/19 3:59 PM, Juri Lelli wrote:
> Hi,
> 
> v9 of a series of patches, originally authored by Mathieu, with the intent
> of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> As originally reported by Steve [1], when hotplug and/or (certain)
> cpuset reconfiguration operations take place, DEADLINE bandwidth
> accounting information is lost since root domains are destroyed and
> recreated.
> 
> Mathieu's approach is based on restoring bandwidth accounting info on
> the newly created root domains by iterating through the (DEADLINE) tasks
> belonging to the configured cpuset(s).
> 
> Apart from the usual rebase on top of cgroup/for-next, this version
> 
>  - make cpuset_{can,cancel}_attach grab cpuset_rwsem for write (5/8 - Peter)
>  - moves v8 8/8 to 7/8 for bisectability (Peter)
>  - adds comment in changelog regarding normalize_rt_tasks() (8/8 - Peter)
> 
> Set also available at
> 
>  https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v9

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Test description:

Juno-r0 (Arm64 big/Little [L b b L L L]) with 6 DL tasks
(12000/100000/100000).

Rt-app runs DL workload for 10min.

After rt-app launched, start CPU hotplug stress test (random CPU hp
in/out except for CPU1 (CPU orig capacity 1024 (big)) during the entire
rt-app run.

Tests ran with 1 and 2 (exclusive cpusets ([0,3-5], [1-2])) root domains.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2019-07-25 13:53   ` Ingo Molnar
  2019-07-25 13:56     ` Ingo Molnar
  2019-07-25 16:21   ` [tip:sched/core] cpusets: " tip-bot for Mathieu Poirier
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2019-07-25 13:53 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	dietmar.eggemann, cgroups


* Juri Lelli <juri.lelli@redhat.com> wrote:

> When the topology of root domains is modified by CPUset or CPUhotplug
> operations information about the current deadline bandwidth held in the
> root domain is lost.
> 
> This patch addresses the issue by recalculating the lost deadline
> bandwidth information by circling through the deadline tasks held in
> CPUsets and adding their current load to the root domain they are
> associated with.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

Was this commit written by Mathieu? If yes then it's missing a From line. 
If not then the Signed-off-by should probably be changed to Acked-by or 
Reviewed-by?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-25 13:53   ` Ingo Molnar
@ 2019-07-25 13:56     ` Ingo Molnar
  2019-07-25 14:07       ` Juri Lelli
  2019-07-25 16:00       ` Mathieu Poirier
  0 siblings, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2019-07-25 13:56 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	dietmar.eggemann, cgroups


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > When the topology of root domains is modified by CPUset or CPUhotplug
> > operations information about the current deadline bandwidth held in the
> > root domain is lost.
> > 
> > This patch addresses the issue by recalculating the lost deadline
> > bandwidth information by circling through the deadline tasks held in
> > CPUsets and adding their current load to the root domain they are
> > associated with.
> > 
> > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> 
> Was this commit written by Mathieu? If yes then it's missing a From line. 
> If not then the Signed-off-by should probably be changed to Acked-by or 
> Reviewed-by?

So for now I'm assuming that the original patch was written by Mathieu, 
with modifications by you. So I added his From line and extended the SOB 
chain with the additional information that you modified the patch:

    Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    [ Various additional modifications. ]
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Let me know if that's not accurate!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-25 13:56     ` Ingo Molnar
@ 2019-07-25 14:07       ` Juri Lelli
  2019-07-25 16:00       ` Mathieu Poirier
  1 sibling, 0 replies; 42+ messages in thread
From: Juri Lelli @ 2019-07-25 14:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	dietmar.eggemann, cgroups

Hi,

On 25/07/19 15:56, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > 
> > * Juri Lelli <juri.lelli@redhat.com> wrote:
> > 
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > > 
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > > 
> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > 
> > Was this commit written by Mathieu? If yes then it's missing a From line. 
> > If not then the Signed-off-by should probably be changed to Acked-by or 
> > Reviewed-by?
> 
> So for now I'm assuming that the original patch was written by Mathieu, 
> with modifications by you. So I added his From line and extended the SOB 
> chain with the additional information that you modified the patch:
> 
>     Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>     Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>     Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
>     [ Various additional modifications. ]
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> Let me know if that's not accurate!

Looks good to me, thanks. And sorry for the confusion.

Best,

Juri

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-25 13:56     ` Ingo Molnar
  2019-07-25 14:07       ` Juri Lelli
@ 2019-07-25 16:00       ` Mathieu Poirier
  1 sibling, 0 replies; 42+ messages in thread
From: Mathieu Poirier @ 2019-07-25 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Juri Lelli, Peter Zijlstra, Ingo Molnar, Steven Rostedt, tj,
	Linux Kernel Mailing List, luca.abeni, Claudio Scordino,
	Tommaso Cucinotta, Daniel Bristot de Oliveira, Li Zefan, longman,
	Dietmar Eggemann, cgroups

On Thu, 25 Jul 2019 at 07:56, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> >
> > * Juri Lelli <juri.lelli@redhat.com> wrote:
> >
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > >
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > >
> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> >
> > Was this commit written by Mathieu? If yes then it's missing a From line.
> > If not then the Signed-off-by should probably be changed to Acked-by or
> > Reviewed-by?
>
> So for now I'm assuming that the original patch was written by Mathieu,
> with modifications by you. So I added his From line and extended the SOB
> chain with the additional information that you modified the patch:
>
>     Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>     Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>     Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
>     [ Various additional modifications. ]
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
>
> Let me know if that's not accurate!

You are correct - thanks,
Mathieu

>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [tip:sched/core] sched/topology: Add partition_sched_domains_locked()
  2019-07-19 13:59 ` [PATCH v9 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2019-07-25 16:19   ` tip-bot for Mathieu Poirier
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Mathieu Poirier @ 2019-07-25 16:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, peterz, tglx, mingo, tj, dietmar.eggemann,
	linux-kernel, mathieu.poirier

Commit-ID:  c22645f4c8f021fb1c5e7189eb1f968132cc0844
Gitweb:     https://git.kernel.org/tip/c22645f4c8f021fb1c5e7189eb1f968132cc0844
Author:     Mathieu Poirier <mathieu.poirier@linaro.org>
AuthorDate: Fri, 19 Jul 2019 15:59:53 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:51:57 +0200

sched/topology: Add partition_sched_domains_locked()

Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7863bb62d2ab..f341163fedc9 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -150,6 +150,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -194,6 +198,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4eea2c9bc732..5a174ae6ecf3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2169,16 +2169,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	bool __maybe_unused has_eas = false;
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -2261,6 +2261,15 @@ match3:
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] sched/core: Streamle calls to task_rq_unlock()
  2019-07-19 13:59 ` [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
  2019-07-22  8:21   ` Dietmar Eggemann
@ 2019-07-25 16:20   ` tip-bot for Mathieu Poirier
  1 sibling, 0 replies; 42+ messages in thread
From: tip-bot for Mathieu Poirier @ 2019-07-25 16:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, dietmar.eggemann, linux-kernel, mathieu.poirier,
	torvalds, tglx, rostedt, hpa, tj

Commit-ID:  4b211f2b129dd1f6a6956bbc76e2f232c1ec3ad8
Gitweb:     https://git.kernel.org/tip/4b211f2b129dd1f6a6956bbc76e2f232c1ec3ad8
Author:     Mathieu Poirier <mathieu.poirier@linaro.org>
AuthorDate: Fri, 19 Jul 2019 15:59:54 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:51:57 +0200

sched/core: Streamle calls to task_rq_unlock()

Calls to task_rq_unlock() are done several times in the
__sched_setscheduler() function.  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-3-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..1af3d2dc6b29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4712,8 +4712,8 @@ recheck:
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4731,8 +4731,8 @@ recheck:
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4745,8 +4745,8 @@ change:
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4761,8 +4761,8 @@ change:
 			 */
 			if (!cpumask_subset(span, p->cpus_ptr) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4781,8 +4781,8 @@ change:
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4840,6 +4840,10 @@ change:
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] cpusets: Rebuild root domain deadline accounting information
  2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  2019-07-25 13:53   ` Ingo Molnar
@ 2019-07-25 16:21   ` tip-bot for Mathieu Poirier
  2019-11-04 18:57   ` [PATCH v9 3/8] cpuset: " Michal Koutný
  2022-12-16 23:35   ` Qais Yousef
  3 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Mathieu Poirier @ 2019-07-25 16:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, juri.lelli, linux-kernel, mingo, tglx, torvalds,
	mathieu.poirier, dietmar.eggemann

Commit-ID:  f9a25f776d780bfa3279f0b6e5f5cf3224997976
Gitweb:     https://git.kernel.org/tip/f9a25f776d780bfa3279f0b6e5f5cf3224997976
Author:     Mathieu Poirier <mathieu.poirier@linaro.org>
AuthorDate: Fri, 19 Jul 2019 15:59:55 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:55:01 +0200

cpusets: Rebuild root domain deadline accounting information

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/cgroup.h         |  1 +
 include/linux/sched.h          |  5 ++++
 include/linux/sched/deadline.h |  8 ++++++
 kernel/cgroup/cgroup.c         |  2 +-
 kernel/cgroup/cpuset.c         | 64 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 30 ++++++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 13 ++++++++-
 8 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f6b048902d6c..3ba3e6da13a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -150,6 +150,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
+void cgroup_enable_task_cg_lists(void);
 void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..b94ad92dfbe6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -295,6 +295,11 @@ enum uclamp_id {
 	UCLAMP_CNT
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 753afbca549f..4b5bc452176c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1891,7 +1891,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
  */
 static bool use_task_css_set_links __read_mostly;
 
-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
 {
 	struct task_struct *p, *g;
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5aa37531ce76..846cbdb68566 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -45,6 +45,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -894,6 +895,67 @@ done:
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	lockdep_assert_held(&cpuset_mutex);
+	lockdep_assert_cpus_held();
+	lockdep_assert_held(&sched_domains_mutex);
+
+	cgroup_enable_task_cg_lists();
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -931,7 +993,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ef5b9f6b1d42..0f9d2180be23 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2283,6 +2283,36 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+
+	raw_spin_unlock(&dl_b->lock);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 16126efd14ed..7583faddba33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -778,9 +778,6 @@ struct root_domain {
 	struct perf_domain __rcu *pd;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5a174ae6ecf3..8f83e8e3ea9a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2203,8 +2203,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j]) &&
-			    dattrs_equal(dattr_cur, i, dattr_new, j))
+			    dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-07-19 13:59 ` [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
  2019-07-22 11:07   ` Dietmar Eggemann
@ 2019-07-25 16:21   ` tip-bot for Juri Lelli
  1 sibling, 0 replies; 42+ messages in thread
From: tip-bot for Juri Lelli @ 2019-07-25 16:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, juri.lelli, peterz, torvalds,
	dietmar.eggemann, tglx, hpa

Commit-ID:  59d06cea1198d665ba11f7e8c5f45b00ff2e4812
Gitweb:     https://git.kernel.org/tip/59d06cea1198d665ba11f7e8c5f45b00ff2e4812
Author:     Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 19 Jul 2019 15:59:56 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:55:02 +0200

sched/deadline: Fix bandwidth accounting at all levels after offline migration

If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f9d2180be23..039dde2b1dac 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -529,6 +529,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
 static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
 {
 	struct rq *later_rq = NULL;
+	struct dl_bw *dl_b;
 
 	later_rq = find_lock_later_rq(p, rq);
 	if (!later_rq) {
@@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 		double_lock_balance(rq, later_rq);
 	}
 
+	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
+		/*
+		 * Inactive timer is armed (or callback is running, but
+		 * waiting for us to release rq locks). In any case, when it
+		 * will fire (or continue), it will see running_bw of this
+		 * task migrated to later_rq (and correctly handle it).
+		 */
+		sub_running_bw(&p->dl, &rq->dl);
+		sub_rq_bw(&p->dl, &rq->dl);
+
+		add_rq_bw(&p->dl, &later_rq->dl);
+		add_running_bw(&p->dl, &later_rq->dl);
+	} else {
+		sub_rq_bw(&p->dl, &rq->dl);
+		add_rq_bw(&p->dl, &later_rq->dl);
+	}
+
+	/*
+	 * And we finally need to fixup root_domain(s) bandwidth accounting,
+	 * since p is still hanging out in the old (now moved to default) root
+	 * domain.
+	 */
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
+	dl_b = &later_rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
 	set_task_cpu(p, later_rq->cpu);
 	double_unlock_balance(later_rq, rq);
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem
  2019-07-19 13:59 ` [PATCH v9 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
@ 2019-07-25 16:22   ` tip-bot for Juri Lelli
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Juri Lelli @ 2019-07-25 16:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dietmar.eggemann, peterz, linux-kernel, tglx, mingo, hpa,
	juri.lelli, torvalds

Commit-ID:  1243dc518c9da467da6635313a2dbb41b8ffc275
Gitweb:     https://git.kernel.org/tip/1243dc518c9da467da6635313a2dbb41b8ffc275
Author:     Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 19 Jul 2019 15:59:57 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:55:02 +0200

cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

Holding cpuset_mutex means that cpusets are stable (only the holder can
make changes) and this is required for fixing a synchronization issue
between cpusets and scheduler core.  However, grabbing cpuset_mutex from
setscheduler() hotpath (as implemented in a later patch) is a no-go, as
it would create a bottleneck for tasks concurrently calling
setscheduler().

Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
setscheduler() will then be able to read lock it and avoid concurrency
issues.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/cgroup/cpuset.c | 68 ++++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 846cbdb68566..e1a8d168c5e9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -333,7 +333,7 @@ static struct cpuset top_cpuset = {
  * guidelines for accessing subsystem state in kernel/cgroup.c
  */
 
-static DEFINE_MUTEX(cpuset_mutex);
+DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
 static DEFINE_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
@@ -913,7 +913,7 @@ static void rebuild_root_domains(void)
 	struct cpuset *cs = NULL;
 	struct cgroup_subsys_state *pos_css;
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 	lockdep_assert_cpus_held();
 	lockdep_assert_held(&sched_domains_mutex);
 
@@ -973,7 +973,7 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	int ndoms;
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 	get_online_cpus();
 
 	/*
@@ -1005,9 +1005,9 @@ static void rebuild_sched_domains_locked(void)
 
 void rebuild_sched_domains(void)
 {
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	rebuild_sched_domains_locked();
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /**
@@ -1113,7 +1113,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	int deleting;	/* Moving cpus from subparts_cpus to effective_cpus */
 	bool part_error = false;	/* Partition error? */
 
-	lockdep_assert_held(&cpuset_mutex);
+	percpu_rwsem_assert_held(&cpuset_rwsem);
 
 	/*
 	 * The parent must be a partition root.
@@ -2101,7 +2101,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
 	cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* allow moving tasks into an empty cpuset if on default hierarchy */
 	ret = -ENOSPC;
@@ -2125,7 +2125,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	cs->attach_in_progress++;
 	ret = 0;
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return ret;
 }
 
@@ -2135,9 +2135,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_first(tset, &css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	css_cs(css)->attach_in_progress--;
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /*
@@ -2160,7 +2160,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* prepare for attach */
 	if (cs == &top_cpuset)
@@ -2214,7 +2214,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	if (!cs->attach_in_progress)
 		wake_up(&cpuset_attach_wq);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2245,7 +2245,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = 0;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs)) {
 		retval = -ENODEV;
 		goto out_unlock;
@@ -2281,7 +2281,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 		break;
 	}
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return retval;
 }
 
@@ -2292,7 +2292,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = -ENODEV;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
@@ -2305,7 +2305,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 		break;
 	}
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return retval;
 }
 
@@ -2344,7 +2344,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	kernfs_break_active_protection(of->kn);
 	flush_work(&cpuset_hotplug_work);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
@@ -2368,7 +2368,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 
 	free_cpuset(trialcs);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
 	flush_workqueue(cpuset_migrate_mm_wq);
@@ -2499,13 +2499,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		return -EINVAL;
 
 	css_get(&cs->css);
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
 	retval = update_prstate(cs, val);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	css_put(&cs->css);
 	return retval ?: nbytes;
 }
@@ -2711,7 +2711,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (!parent)
 		return 0;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	set_bit(CS_ONLINE, &cs->flags);
 	if (is_spread_page(parent))
@@ -2762,7 +2762,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
 out_unlock:
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 	return 0;
 }
 
@@ -2781,7 +2781,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	if (is_partition_root(cs))
 		update_prstate(cs, 0);
@@ -2800,7 +2800,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)
@@ -2812,7 +2812,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 	spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
@@ -2825,7 +2825,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 	}
 
 	spin_unlock_irq(&callback_lock);
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /*
@@ -2867,6 +2867,8 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 
 int __init cpuset_init(void)
 {
+	BUG_ON(percpu_init_rwsem(&cpuset_rwsem));
+
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
@@ -2938,7 +2940,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
 		   nodes_empty(cs->mems_allowed);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 
 	/*
 	 * Move tasks to the nearest ancestor with execution resources,
@@ -2948,7 +2950,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 	if (is_empty)
 		remove_tasks_in_empty_cpuset(cs);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 }
 
 static void
@@ -2998,14 +3000,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 retry:
 	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/*
 	 * We have raced with task attaching. We wait until attaching
 	 * is finished, so we won't attach a task to an empty cpuset.
 	 */
 	if (cs->attach_in_progress) {
-		mutex_unlock(&cpuset_mutex);
+		percpu_up_write(&cpuset_rwsem);
 		goto retry;
 	}
 
@@ -3073,7 +3075,7 @@ update_tasks:
 		hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
 					    cpus_updated, mems_updated);
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 }
 
 /**
@@ -3103,7 +3105,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	if (on_dfl && !alloc_cpumasks(NULL, &tmp))
 		ptmp = &tmp;
 
-	mutex_lock(&cpuset_mutex);
+	percpu_down_write(&cpuset_rwsem);
 
 	/* fetch the available cpus/mems and find out which changed how */
 	cpumask_copy(&new_cpus, cpu_active_mask);
@@ -3153,7 +3155,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		update_tasks_nodemask(&top_cpuset);
 	}
 
-	mutex_unlock(&cpuset_mutex);
+	percpu_up_write(&cpuset_rwsem);
 
 	/* if cpus or mems changed, we need to propagate to descendants */
 	if (cpus_updated || mems_updated) {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
  2019-07-19 13:59 ` [PATCH v9 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
@ 2019-07-25 16:23   ` tip-bot for Juri Lelli
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Juri Lelli @ 2019-07-25 16:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: juri.lelli, linux-kernel, dietmar.eggemann, mingo, hpa, tglx,
	torvalds, peterz

Commit-ID:  d74b27d63a8bebe2fe634944e4ebdc7b10db7a39
Gitweb:     https://git.kernel.org/tip/d74b27d63a8bebe2fe634944e4ebdc7b10db7a39
Author:     Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 19 Jul 2019 15:59:58 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:55:03 +0200

cgroup/cpuset: Change cpuset_rwsem and hotplug lock order

cpuset_rwsem is going to be acquired from sched_setscheduler() with a
following patch. There are however paths (e.g., spawn_ksoftirqd) in
which sched_scheduler() is eventually called while holding hotplug lock;
this creates a dependecy between hotplug lock (to be always acquired
first) and cpuset_rwsem (to be always acquired after hotplug lock).

Fix paths which currently take the two locks in the wrong order (after
a following patch is applied).

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/cpuset.h |  8 ++++----
 kernel/cgroup/cpuset.c | 22 +++++++++++++++++-----
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..7f1478c26a33 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -40,14 +40,14 @@ static inline bool cpusets_enabled(void)
 
 static inline void cpuset_inc(void)
 {
-	static_branch_inc(&cpusets_pre_enable_key);
-	static_branch_inc(&cpusets_enabled_key);
+	static_branch_inc_cpuslocked(&cpusets_pre_enable_key);
+	static_branch_inc_cpuslocked(&cpusets_enabled_key);
 }
 
 static inline void cpuset_dec(void)
 {
-	static_branch_dec(&cpusets_enabled_key);
-	static_branch_dec(&cpusets_pre_enable_key);
+	static_branch_dec_cpuslocked(&cpusets_enabled_key);
+	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
 }
 
 extern int cpuset_init(void);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e1a8d168c5e9..5c5014caa23c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -973,8 +973,8 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	int ndoms;
 
+	lockdep_assert_cpus_held();
 	percpu_rwsem_assert_held(&cpuset_rwsem);
-	get_online_cpus();
 
 	/*
 	 * We have raced with CPU hotplug. Don't do anything to avoid
@@ -983,19 +983,17 @@ static void rebuild_sched_domains_locked(void)
 	 */
 	if (!top_cpuset.nr_subparts_cpus &&
 	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
-		goto out;
+		return;
 
 	if (top_cpuset.nr_subparts_cpus &&
 	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
-		goto out;
+		return;
 
 	/* Generate domain masks and attrs */
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
 	partition_and_rebuild_sched_domains(ndoms, doms, attr);
-out:
-	put_online_cpus();
 }
 #else /* !CONFIG_SMP */
 static void rebuild_sched_domains_locked(void)
@@ -1005,9 +1003,11 @@ static void rebuild_sched_domains_locked(void)
 
 void rebuild_sched_domains(void)
 {
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	rebuild_sched_domains_locked();
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 }
 
 /**
@@ -2245,6 +2245,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = 0;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs)) {
 		retval = -ENODEV;
@@ -2282,6 +2283,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	}
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return retval;
 }
 
@@ -2292,6 +2294,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	cpuset_filetype_t type = cft->private;
 	int retval = -ENODEV;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2306,6 +2309,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
 	}
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return retval;
 }
 
@@ -2344,6 +2348,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	kernfs_break_active_protection(of->kn);
 	flush_work(&cpuset_hotplug_work);
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2369,6 +2374,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	free_cpuset(trialcs);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
 	flush_workqueue(cpuset_migrate_mm_wq);
@@ -2499,6 +2505,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		return -EINVAL;
 
 	css_get(&cs->css);
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
@@ -2506,6 +2513,7 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 	retval = update_prstate(cs, val);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	css_put(&cs->css);
 	return retval ?: nbytes;
 }
@@ -2711,6 +2719,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (!parent)
 		return 0;
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 
 	set_bit(CS_ONLINE, &cs->flags);
@@ -2763,6 +2772,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	spin_unlock_irq(&callback_lock);
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 	return 0;
 }
 
@@ -2781,6 +2791,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
 
+	get_online_cpus();
 	percpu_down_write(&cpuset_rwsem);
 
 	if (is_partition_root(cs))
@@ -2801,6 +2812,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	percpu_up_write(&cpuset_rwsem);
+	put_online_cpus();
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region
  2019-07-19 13:59 ` [PATCH v9 7/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
@ 2019-07-25 16:24   ` tip-bot for Juri Lelli
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Juri Lelli @ 2019-07-25 16:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: juri.lelli, hpa, mingo, peterz, linux-kernel, torvalds, tglx,
	dietmar.eggemann

Commit-ID:  1a763fd7c6335e3122c1cc09576ef6c99ada4267
Gitweb:     https://git.kernel.org/tip/1a763fd7c6335e3122c1cc09576ef6c99ada4267
Author:     Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 19 Jul 2019 15:59:59 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:55:03 +0200

rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region

sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-8-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/rcu/tree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a14e5fbbea46..eb764c24bc4d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3234,13 +3234,13 @@ static int __init rcu_spawn_gp_kthread(void)
 	t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
 	if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
 		return 0;
+	if (kthread_prio)
+		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
 	rnp = rcu_get_root();
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	rcu_state.gp_kthread = t;
-	if (kthread_prio) {
+	if (kthread_prio)
 		sp.sched_priority = kthread_prio;
-		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
-	}
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	wake_up_process(t);
 	rcu_spawn_nocb_kthreads();

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:sched/core] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-07-19 14:00 ` [PATCH v9 8/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-07-25 16:24   ` tip-bot for Juri Lelli
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Juri Lelli @ 2019-07-25 16:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, peterz, linux-kernel, tglx, mingo, juri.lelli,
	dietmar.eggemann

Commit-ID:  710da3c8ea7dfbd327920afd3831d8c82c42789d
Gitweb:     https://git.kernel.org/tip/710da3c8ea7dfbd327920afd3831d8c82c42789d
Author:     Juri Lelli <juri.lelli@redhat.com>
AuthorDate: Fri, 19 Jul 2019 16:00:00 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:55:04 +0200

sched/core: Prevent race condition between cpuset and __sched_setscheduler()

No synchronisation mechanism exists between the cpuset subsystem and
calls to function __sched_setscheduler(). As such, it is possible that
new root domains are created on the cpuset side while a deadline
acceptance test is carried out in __sched_setscheduler(), leading to a
potential oversell of CPU bandwidth.

Grab cpuset_rwsem read lock from core scheduler, so to prevent
situations such as the one described above from happening.

The only exception is normalize_rt_tasks() which needs to work under
tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
this, as this function is only called by sysrq and, if that gets
triggered, DEADLINE guarantees are already gone out of the window
anyway.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/cpuset.h |  5 +++++
 kernel/cgroup/cpuset.c | 11 +++++++++++
 kernel/sched/core.c    | 20 +++++++++++++++++---
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7f1478c26a33..04c20de66afc 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_lock(void);
+extern void cpuset_read_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,9 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_lock(void) { }
+static inline void cpuset_read_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5c5014caa23c..c52bc91f882b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -334,6 +334,17 @@ static struct cpuset top_cpuset = {
  */
 
 DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
+
+void cpuset_read_lock(void)
+{
+	percpu_down_read(&cpuset_rwsem);
+}
+
+void cpuset_read_unlock(void)
+{
+	percpu_up_read(&cpuset_rwsem);
+}
+
 static DEFINE_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1af3d2dc6b29..1bceb22dac18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4698,6 +4698,9 @@ recheck:
 			return retval;
 	}
 
+	if (pi)
+		cpuset_read_lock();
+
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
@@ -4772,6 +4775,8 @@ change:
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
 		task_rq_unlock(rq, p, &rf);
+		if (pi)
+			cpuset_read_unlock();
 		goto recheck;
 	}
 
@@ -4832,8 +4837,10 @@ change:
 	preempt_disable();
 	task_rq_unlock(rq, p, &rf);
 
-	if (pi)
+	if (pi) {
+		cpuset_read_unlock();
 		rt_mutex_adjust_pi(p);
+	}
 
 	/* Run balance callbacks after we've adjusted the PI chain: */
 	balance_callback(rq);
@@ -4843,6 +4850,8 @@ change:
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
+	if (pi)
+		cpuset_read_unlock();
 	return retval;
 }
 
@@ -4927,10 +4936,15 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
-		retval = sched_setscheduler(p, policy, &lparam);
+	if (likely(p))
+		get_task_struct(p);
 	rcu_read_unlock();
 
+	if (likely(p)) {
+		retval = sched_setscheduler(p, policy, &lparam);
+		put_task_struct(p);
+	}
+
 	return retval;
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  2019-07-25 13:53   ` Ingo Molnar
  2019-07-25 16:21   ` [tip:sched/core] cpusets: " tip-bot for Mathieu Poirier
@ 2019-11-04 18:57   ` Michal Koutný
  2019-11-05 12:03     ` Michal Koutný
  2022-12-16 23:35   ` Qais Yousef
  3 siblings, 1 reply; 42+ messages in thread
From: Michal Koutný @ 2019-11-04 18:57 UTC (permalink / raw)
  To: Juri Lelli, mathieu.poirier
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, lizefan, longman, dietmar.eggemann,
	cgroups

[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]

Hello.

I came across a cgroup_enable_task_cg_lists() call from the cpuset
controller...

On Fri, Jul 19, 2019 at 03:59:55PM +0200, Juri Lelli <juri.lelli@redhat.com> wrote:
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> [...]
> +static void rebuild_root_domains(void)
> +{
> +	struct cpuset *cs = NULL;
> +	struct cgroup_subsys_state *pos_css;
> +
> +	lockdep_assert_held(&cpuset_mutex);
> +	lockdep_assert_cpus_held();
> +	lockdep_assert_held(&sched_domains_mutex);
> +
> +	cgroup_enable_task_cg_lists();
...and I wonder why is it necessary to call at this place?

(IIUC, before cpuset hierarchy is anywhere mounted it's mere top_cpuset,
i.e. processing the top_cpuset alone is enough. And if anyone wants to
create any non-root cpusets, they have to mount the hierachy first, i.e.
no need to call cgroup_enable_task_cg_lists() manually. Also if I'm not
overlooking anything, the race between hotplug and mount (more precisely
new cpuset creation) should be synchronized by cpuset_mutex.)

Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-11-04 18:57   ` [PATCH v9 3/8] cpuset: " Michal Koutný
@ 2019-11-05 12:03     ` Michal Koutný
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Koutný @ 2019-11-05 12:03 UTC (permalink / raw)
  To: Juri Lelli, mathieu.poirier
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, lizefan, longman, dietmar.eggemann,
	cgroups

[-- Attachment #1: Type: text/plain, Size: 461 bytes --]

On Mon, Nov 04, 2019 at 07:57:42PM +0100, Michal Koutný <mkoutny@suse.com> wrote:
> I came across a cgroup_enable_task_cg_lists() call from the cpuset
> controller...
>[...]
> > +	cgroup_enable_task_cg_lists();
> ...and I wonder why is it necessary to call at this place?
Consider this resolved.

I realized the on-demand linking is being removed [1].

Michal

[1] https://lore.kernel.org/lkml/20191024190351.GD3622521@devbig004.ftw2.facebook.com/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
                     ` (2 preceding siblings ...)
  2019-11-04 18:57   ` [PATCH v9 3/8] cpuset: " Michal Koutný
@ 2022-12-16 23:35   ` Qais Yousef
  2022-12-17  2:26     ` Waiman Long
  2022-12-19  8:07     ` Vincent Guittot
  3 siblings, 2 replies; 42+ messages in thread
From: Qais Yousef @ 2022-12-16 23:35 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, longman,
	dietmar.eggemann, cgroups, Vincent Guittot, Wei Wang, Rick Yiu,
	Quentin Perret

Hi

On 07/19/19 15:59, Juri Lelli wrote:
> When the topology of root domains is modified by CPUset or CPUhotplug
> operations information about the current deadline bandwidth held in the
> root domain is lost.
> 
> This patch addresses the issue by recalculating the lost deadline
> bandwidth information by circling through the deadline tasks held in
> CPUsets and adding their current load to the root domain they are
> associated with.
> 
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> ---

We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
consistently) on suspend/resume.

Do we actually need to rebuild_root_domain() if we're going through
a suspend/resume cycle?

ie: would something like the below make sense? We'd skip this logic if
cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
we're suspending/resuming.


Cheers

--
Qais Yousef


--->8---


From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
From: Qais Yousef <qyousef@layalina.io>
Date: Tue, 29 Nov 2022 19:01:52 +0000
Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume

Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
enabled rebuilding sched domain on cpuset and hotplug operations to
correct deadline accounting.

Rebuilding sched domain is a slow operation and we see 10+ ms delays
on suspend-resume because of that.

Since nothing is expected to change on suspend-resume operation; skip
rebuilding the sched domains to regain some of the time lost.

Debugged-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/cgroup/cpuset.c  | 6 ++++++
 kernel/sched/deadline.c | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b474289c15b8..2ff68d625b7b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
+	if (cpuhp_tasks_frozen)
+		return;
+
 	css_task_iter_start(&cs->css, 0, &it);
 
 	while ((task = css_task_iter_next(&it)))
@@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
 	lockdep_assert_cpus_held();
 	lockdep_assert_held(&sched_domains_mutex);
 
+	if (cpuhp_tasks_frozen)
+		return;
+
 	rcu_read_lock();
 
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0d97d54276cc..42c1143a3956 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2575,6 +2575,9 @@ void dl_clear_root_domain(struct root_domain *rd)
 {
 	unsigned long flags;
 
+	if (cpuhp_tasks_frozen)
+		return;
+
 	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
 	rd->dl_bw.total_bw = 0;
 	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2022-12-16 23:35   ` Qais Yousef
@ 2022-12-17  2:26     ` Waiman Long
  2022-12-17 22:10       ` Qais Yousef
  2022-12-19  8:07     ` Vincent Guittot
  1 sibling, 1 reply; 42+ messages in thread
From: Waiman Long @ 2022-12-17  2:26 UTC (permalink / raw)
  To: Qais Yousef, Juri Lelli
  Cc: peterz, mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	dietmar.eggemann, cgroups, Vincent Guittot, Wei Wang, Rick Yiu,
	Quentin Perret

On 12/16/22 18:35, Qais Yousef wrote:
> Hi
>
> On 07/19/19 15:59, Juri Lelli wrote:
>> When the topology of root domains is modified by CPUset or CPUhotplug
>> operations information about the current deadline bandwidth held in the
>> root domain is lost.
>>
>> This patch addresses the issue by recalculating the lost deadline
>> bandwidth information by circling through the deadline tasks held in
>> CPUsets and adding their current load to the root domain they are
>> associated with.
>>
>> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
>> ---
> We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> consistently) on suspend/resume.
>
> Do we actually need to rebuild_root_domain() if we're going through
> a suspend/resume cycle?
>
> ie: would something like the below make sense? We'd skip this logic if
> cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
> we're suspending/resuming.
>
>
> Cheers
>
> --
> Qais Yousef
>
>
> --->8---
>
>
>  From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <qyousef@layalina.io>
> Date: Tue, 29 Nov 2022 19:01:52 +0000
> Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume
>
> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> enabled rebuilding sched domain on cpuset and hotplug operations to
> correct deadline accounting.
>
> Rebuilding sched domain is a slow operation and we see 10+ ms delays
> on suspend-resume because of that.
>
> Since nothing is expected to change on suspend-resume operation; skip
> rebuilding the sched domains to regain some of the time lost.
>
> Debugged-by: Rick Yiu <rickyiu@google.com>
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>   kernel/cgroup/cpuset.c  | 6 ++++++
>   kernel/sched/deadline.c | 3 +++
>   2 files changed, 9 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b474289c15b8..2ff68d625b7b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
>   	struct css_task_iter it;
>   	struct task_struct *task;
>   
> +	if (cpuhp_tasks_frozen)
> +		return;
> +
>   	css_task_iter_start(&cs->css, 0, &it);
>   
>   	while ((task = css_task_iter_next(&it)))
> @@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
>   	lockdep_assert_cpus_held();
>   	lockdep_assert_held(&sched_domains_mutex);
>   
> +	if (cpuhp_tasks_frozen)
> +		return;
> +
>   	rcu_read_lock();
>   
>   	/*

rebuild_root_domains() is the only caller of update_tasks_root_domain(). 
So the first hunk is redundant as update_tasks_root_domain() won't be 
called when cpuhp_tasks_frozen is set.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2022-12-17  2:26     ` Waiman Long
@ 2022-12-17 22:10       ` Qais Yousef
  0 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2022-12-17 22:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Juri Lelli, peterz, mingo, rostedt, tj, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	dietmar.eggemann, cgroups, Vincent Guittot, Wei Wang, Rick Yiu,
	Quentin Perret

On 12/16/22 21:26, Waiman Long wrote:
> On 12/16/22 18:35, Qais Yousef wrote:
> > Hi
> > 
> > On 07/19/19 15:59, Juri Lelli wrote:
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > > 
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > > 
> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > > ---
> > We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> > consistently) on suspend/resume.
> > 
> > Do we actually need to rebuild_root_domain() if we're going through
> > a suspend/resume cycle?
> > 
> > ie: would something like the below make sense? We'd skip this logic if
> > cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
> > we're suspending/resuming.
> > 
> > 
> > Cheers
> > 
> > --
> > Qais Yousef
> > 
> > 
> > --->8---
> > 
> > 
> >  From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
> > From: Qais Yousef <qyousef@layalina.io>
> > Date: Tue, 29 Nov 2022 19:01:52 +0000
> > Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume
> > 
> > Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> > enabled rebuilding sched domain on cpuset and hotplug operations to
> > correct deadline accounting.
> > 
> > Rebuilding sched domain is a slow operation and we see 10+ ms delays
> > on suspend-resume because of that.
> > 
> > Since nothing is expected to change on suspend-resume operation; skip
> > rebuilding the sched domains to regain some of the time lost.
> > 
> > Debugged-by: Rick Yiu <rickyiu@google.com>
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >   kernel/cgroup/cpuset.c  | 6 ++++++
> >   kernel/sched/deadline.c | 3 +++
> >   2 files changed, 9 insertions(+)
> > 
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index b474289c15b8..2ff68d625b7b 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
> >   	struct css_task_iter it;
> >   	struct task_struct *task;
> > +	if (cpuhp_tasks_frozen)
> > +		return;
> > +
> >   	css_task_iter_start(&cs->css, 0, &it);
> >   	while ((task = css_task_iter_next(&it)))
> > @@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
> >   	lockdep_assert_cpus_held();
> >   	lockdep_assert_held(&sched_domains_mutex);
> > +	if (cpuhp_tasks_frozen)
> > +		return;
> > +
> >   	rcu_read_lock();
> >   	/*
> 
> rebuild_root_domains() is the only caller of update_tasks_root_domain(). So
> the first hunk is redundant as update_tasks_root_domain() won't be called
> when cpuhp_tasks_frozen is set.

True. I went overzealous and protected all the functions.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2022-12-16 23:35   ` Qais Yousef
  2022-12-17  2:26     ` Waiman Long
@ 2022-12-19  8:07     ` Vincent Guittot
  2022-12-20 11:43       ` Qais Yousef
  1 sibling, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2022-12-19  8:07 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Juri Lelli, peterz, mingo, rostedt, tj, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	longman, dietmar.eggemann, cgroups, Wei Wang, Rick Yiu,
	Quentin Perret

On Sat, 17 Dec 2022 at 00:35, Qais Yousef <qyousef@layalina.io> wrote:
>
> Hi
>
> On 07/19/19 15:59, Juri Lelli wrote:
> > When the topology of root domains is modified by CPUset or CPUhotplug
> > operations information about the current deadline bandwidth held in the
> > root domain is lost.
> >
> > This patch addresses the issue by recalculating the lost deadline
> > bandwidth information by circling through the deadline tasks held in
> > CPUsets and adding their current load to the root domain they are
> > associated with.
> >
> > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > ---
>
> We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> consistently) on suspend/resume.
>
> Do we actually need to rebuild_root_domain() if we're going through
> a suspend/resume cycle?

During suspend to ram, there are cpus hotplug operation but If you use
suspend to idle, you will skip cpus hotplug operation and its
associated rebuild.

>
> ie: would something like the below make sense? We'd skip this logic if
> cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
> we're suspending/resuming.
>
>
> Cheers
>
> --
> Qais Yousef
>
>
> --->8---
>
>
> From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <qyousef@layalina.io>
> Date: Tue, 29 Nov 2022 19:01:52 +0000
> Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume
>
> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> enabled rebuilding sched domain on cpuset and hotplug operations to
> correct deadline accounting.
>
> Rebuilding sched domain is a slow operation and we see 10+ ms delays
> on suspend-resume because of that.
>
> Since nothing is expected to change on suspend-resume operation; skip
> rebuilding the sched domains to regain some of the time lost.
>
> Debugged-by: Rick Yiu <rickyiu@google.com>
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/cgroup/cpuset.c  | 6 ++++++
>  kernel/sched/deadline.c | 3 +++
>  2 files changed, 9 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b474289c15b8..2ff68d625b7b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
>         struct css_task_iter it;
>         struct task_struct *task;
>
> +       if (cpuhp_tasks_frozen)
> +               return;
> +
>         css_task_iter_start(&cs->css, 0, &it);
>
>         while ((task = css_task_iter_next(&it)))
> @@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
>         lockdep_assert_cpus_held();
>         lockdep_assert_held(&sched_domains_mutex);
>
> +       if (cpuhp_tasks_frozen)
> +               return;
> +
>         rcu_read_lock();
>
>         /*
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0d97d54276cc..42c1143a3956 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2575,6 +2575,9 @@ void dl_clear_root_domain(struct root_domain *rd)
>  {
>         unsigned long flags;
>
> +       if (cpuhp_tasks_frozen)
> +               return;
> +
>         raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
>         rd->dl_bw.total_bw = 0;
>         raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2022-12-19  8:07     ` Vincent Guittot
@ 2022-12-20 11:43       ` Qais Yousef
  2023-01-15 16:41         ` Qais Yousef
  0 siblings, 1 reply; 42+ messages in thread
From: Qais Yousef @ 2022-12-20 11:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Juri Lelli, peterz, mingo, rostedt, tj, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	longman, dietmar.eggemann, cgroups, Wei Wang, Rick Yiu,
	Quentin Perret

On 12/19/22 09:07, Vincent Guittot wrote:
> On Sat, 17 Dec 2022 at 00:35, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > Hi
> >
> > On 07/19/19 15:59, Juri Lelli wrote:
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > >
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > >
> > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > > ---
> >
> > We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> > consistently) on suspend/resume.
> >
> > Do we actually need to rebuild_root_domain() if we're going through
> > a suspend/resume cycle?
> 
> During suspend to ram, there are cpus hotplug operation but If you use
> suspend to idle, you will skip cpus hotplug operation and its
> associated rebuild.

Thanks Vincent. I'll check on that - but if we want to keep suspend to ram?
Do we really to incur this hit?


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information
  2022-12-20 11:43       ` Qais Yousef
@ 2023-01-15 16:41         ` Qais Yousef
  0 siblings, 0 replies; 42+ messages in thread
From: Qais Yousef @ 2023-01-15 16:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Juri Lelli, peterz, mingo, rostedt, tj, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	longman, dietmar.eggemann, cgroups, Wei Wang, Rick Yiu,
	Quentin Perret

On 12/20/22 11:43, Qais Yousef wrote:
> On 12/19/22 09:07, Vincent Guittot wrote:
> > On Sat, 17 Dec 2022 at 00:35, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > Hi
> > >
> > > On 07/19/19 15:59, Juri Lelli wrote:
> > > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > > operations information about the current deadline bandwidth held in the
> > > > root domain is lost.
> > > >
> > > > This patch addresses the issue by recalculating the lost deadline
> > > > bandwidth information by circling through the deadline tasks held in
> > > > CPUsets and adding their current load to the root domain they are
> > > > associated with.
> > > >
> > > > Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> > > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > > > ---
> > >
> > > We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> > > consistently) on suspend/resume.
> > >
> > > Do we actually need to rebuild_root_domain() if we're going through
> > > a suspend/resume cycle?
> > 
> > During suspend to ram, there are cpus hotplug operation but If you use
> > suspend to idle, you will skip cpus hotplug operation and its
> > associated rebuild.
> 
> Thanks Vincent. I'll check on that - but if we want to keep suspend to ram?
> Do we really to incur this hit?

Using s2idle is not an option actually. I'll prepare v2 to address Waiman
comment if I don't get more feedback in the next few days.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-01-15 16:41 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-19 13:59 [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2019-07-19 13:59 ` [PATCH v9 1/8] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2019-07-25 16:19   ` [tip:sched/core] sched/topology: Add partition_sched_domains_locked() tip-bot for Mathieu Poirier
2019-07-19 13:59 ` [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2019-07-22  8:21   ` Dietmar Eggemann
2019-07-22  8:32     ` Juri Lelli
2019-07-22  9:08       ` Dietmar Eggemann
2019-07-23 10:32         ` Peter Zijlstra
2019-07-23 10:31       ` Peter Zijlstra
2019-07-23 13:11         ` Tejun Heo
2019-07-23 14:58           ` Juri Lelli
2019-07-25 16:20   ` [tip:sched/core] sched/core: Streamle " tip-bot for Mathieu Poirier
2019-07-19 13:59 ` [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information Juri Lelli
2019-07-25 13:53   ` Ingo Molnar
2019-07-25 13:56     ` Ingo Molnar
2019-07-25 14:07       ` Juri Lelli
2019-07-25 16:00       ` Mathieu Poirier
2019-07-25 16:21   ` [tip:sched/core] cpusets: " tip-bot for Mathieu Poirier
2019-11-04 18:57   ` [PATCH v9 3/8] cpuset: " Michal Koutný
2019-11-05 12:03     ` Michal Koutný
2022-12-16 23:35   ` Qais Yousef
2022-12-17  2:26     ` Waiman Long
2022-12-17 22:10       ` Qais Yousef
2022-12-19  8:07     ` Vincent Guittot
2022-12-20 11:43       ` Qais Yousef
2023-01-15 16:41         ` Qais Yousef
2019-07-19 13:59 ` [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
2019-07-22 11:07   ` Dietmar Eggemann
2019-07-22 12:28     ` Juri Lelli
2019-07-22 13:21       ` Dietmar Eggemann
2019-07-22 13:35         ` Juri Lelli
2019-07-22 14:29           ` Dietmar Eggemann
2019-07-25 16:21   ` [tip:sched/core] " tip-bot for Juri Lelli
2019-07-19 13:59 ` [PATCH v9 5/8] cgroup/cpuset: convert cpuset_mutex to percpu_rwsem Juri Lelli
2019-07-25 16:22   ` [tip:sched/core] cgroup/cpuset: Convert " tip-bot for Juri Lelli
2019-07-19 13:59 ` [PATCH v9 6/8] cgroup/cpuset: Change cpuset_rwsem and hotplug lock order Juri Lelli
2019-07-25 16:23   ` [tip:sched/core] " tip-bot for Juri Lelli
2019-07-19 13:59 ` [PATCH v9 7/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region Juri Lelli
2019-07-25 16:24   ` [tip:sched/core] rcu/tree: Call setschedule() " tip-bot for Juri Lelli
2019-07-19 14:00 ` [PATCH v9 8/8] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2019-07-25 16:24   ` [tip:sched/core] " tip-bot for Juri Lelli
2019-07-24  8:45 ` [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).