[PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
@ 2019-01-17  8:47 Juri Lelli
  2019-01-17  8:47 ` [PATCH v6 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Juri Lelli @ 2019-01-17  8:47 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

Hi,

v6 of a series of patches, originally authored by Mathieu, with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

Apart from some minor refactoring needed to rebase the set on top of
Waiman Long's cpuset for cgroup series (now mainline), two changes worth
of notice:

 - added some more descriptive comments about why callback_lock gives
   the holder read-only access to cpusets [Steve] 04/05
 - call cgroup_enable_task_cg_list if we need to traverse the list of
   tasks belonging to a particular cgroup (to rebuild its bandwidth),
   but such list is not yet ready (this can for example happen if CPUs
   are hotplugged during early boot stages) 05/05

Set also available at

 https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v6

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966

Juri Lelli (1):
  cgroup/cpuset: make callback_lock raw

Mathieu Poirier (4):
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()
  cpuset: Rebuild root domain deadline accounting information

 include/linux/cgroup.h         |   1 +
 include/linux/cpuset.h         |   6 ++
 include/linux/sched.h          |   5 ++
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |  10 +++
 kernel/cgroup/cgroup.c         |   2 +-
 kernel/cgroup/cpuset.c         | 159 +++++++++++++++++++++++++--------
 kernel/sched/core.c            |  34 ++++---
 kernel/sched/deadline.c        |  31 +++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  32 +++++--
 11 files changed, 234 insertions(+), 57 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 1/5] sched/topology: Adding function partition_sched_domains_locked()
  2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2019-01-17  8:47 ` Juri Lelli
  2019-01-17  8:47 ` [PATCH v6 2/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-01-17  8:47 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 6b9976180c1e..d69e7d56253a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -162,6 +162,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -206,6 +210,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8d7f15ba5916..dd94000cb378 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1929,15 +1929,15 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -2002,6 +2002,15 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 2/5] sched/core: Streamlining calls to task_rq_unlock()
  2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-01-17  8:47 ` [PATCH v6 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2019-01-17  8:47 ` Juri Lelli
  2019-01-17  8:47 ` [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-01-17  8:47 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 091e089063be..f5263383170e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4228,8 +4228,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4245,8 +4245,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4259,8 +4259,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4275,8 +4275,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4295,8 +4295,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4352,6 +4352,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw
  2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-01-17  8:47 ` [PATCH v6 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
  2019-01-17  8:47 ` [PATCH v6 2/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2019-01-17  8:47 ` Juri Lelli
  2019-02-04 11:55   ` Peter Zijlstra
  2019-02-04 12:02   ` Peter Zijlstra
  2019-01-17  8:47 ` [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Juri Lelli @ 2019-01-17  8:47 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

callback_lock grants the holder read-only access to cpusets.  For fixing
a synchronization issue between cpusets and scheduler core, it is now
required to make callback_lock available to core scheduler code.

Convert callback_lock to raw_spin_lock, so that it will be always safe
to acquire it from atomic context.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v5->v6: minor changes required to rebase on top of Waiman Long's cpuset
for cgroup v2 set.
---
 kernel/cgroup/cpuset.c | 70 +++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f0decd8165e7..54b8ff5de7c8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -345,7 +345,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
 
@@ -1220,7 +1220,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * Newly added CPUs will be removed from effective_cpus and
 	 * newly deleted ones will be added back to effective_cpus.
 	 */
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (adding) {
 		cpumask_or(parent->subparts_cpus,
 			   parent->subparts_cpus, tmp->addmask);
@@ -1239,7 +1239,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	}
 
 	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	return cmd == partcmd_update;
 }
@@ -1344,7 +1344,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus &&
@@ -1375,7 +1375,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 					= cpumask_weight(cp->subparts_cpus);
 			}
 		}
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -1493,7 +1493,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			return -EINVAL;
 	}
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 
 	/*
@@ -1504,7 +1504,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			       cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	update_cpumasks_hier(cs, &tmp);
 
@@ -1698,9 +1698,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1768,9 +1768,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -1861,9 +1861,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -2366,7 +2366,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -2388,7 +2388,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	return ret;
 }
 
@@ -2698,14 +2698,14 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 		cs->use_parent_ecpus = true;
 		parent->child_ecpus_count++;
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -2732,12 +2732,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2790,7 +2790,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
 	mutex_lock(&cpuset_mutex);
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2801,7 +2801,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2902,12 +2902,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 {
 	bool is_empty;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2944,10 +2944,10 @@ hotplug_update_tasks(struct cpuset *cs,
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -3102,7 +3102,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		/*
@@ -3122,17 +3122,17 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 			}
 		}
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -3233,11 +3233,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -3285,11 +3285,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	nodemask_t mask;
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -3381,14 +3381,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (2 preceding siblings ...)
  2019-01-17  8:47 ` [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
@ 2019-01-17  8:47 ` Juri Lelli
  2019-02-04 12:10   ` Peter Zijlstra
  2019-01-17  8:47 ` [PATCH v6 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  2019-01-18 16:17 ` [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Tejun Heo
  5 siblings, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2019-01-17  8:47 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

No synchronisation mechanism exists between the cpuset subsystem and calls
to function __sched_setscheduler(). As such, it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

Grab callback_lock from core scheduler, so to prevent situations such as
the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v5->v6: Added more descriptive comments about why callback_lock gives
the holder read-only access to cpusets.
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 25 ++++++++++++++++++++++++-
 kernel/sched/core.c    | 10 ++++++++++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..8e5a8dd0622b 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_only_lock(void);
+extern void cpuset_read_only_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,10 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_only_lock(void) { }
+
+static inline void cpuset_read_only_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 54b8ff5de7c8..8ee1b24b6353 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -330,7 +330,8 @@ static struct cpuset top_cpuset = {
  * __alloc_pages().
  *
  * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
+ * access to cpusets. Mind that callback_lock might be grabbed from other
+ * subsystems as well (via cpuset_read_only_lock()).
  *
  * Now, the task_struct fields mems_allowed and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
@@ -3218,6 +3219,28 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_read_only_lock - Grab the callback_lock from cpuset subsystem.
+ *
+ * Description: As described in full details the comment above cpuset_mutex
+ * and callback_lock definitions, holding callback_lock gives the holder
+ * read-only access to cpusets. Even though it might look counter-intuitive
+ * (as callback_lock is a spinlock), in fact a task must hold both
+ * callback_lock _and_ cpuset_mutex to modify cpusets (write access).
+ */
+void cpuset_read_only_lock(void)
+{
+	raw_spin_lock(&callback_lock);
+}
+
+/**
+ * cpuset_read_only_unlock - Release the callback_lock from cpuset subsystem.
+ */
+void cpuset_read_only_unlock(void)
+{
+	raw_spin_unlock(&callback_lock);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5263383170e..d928a42b8852 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4224,6 +4224,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_read_only_lock();
+
 	/*
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
@@ -4285,6 +4292,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Re-check policy now with rq lock held: */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
+		cpuset_read_only_unlock();
 		task_rq_unlock(rq, p, &rf);
 		goto recheck;
 	}
@@ -4342,6 +4350,7 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
+	cpuset_read_only_unlock();
 	task_rq_unlock(rq, p, &rf);
 
 	if (pi)
@@ -4354,6 +4363,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	return 0;
 
 unlock:
+	cpuset_read_only_unlock();
 	task_rq_unlock(rq, p, &rf);
 	return retval;
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 5/5] cpuset: Rebuild root domain deadline accounting information
  2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (3 preceding siblings ...)
  2019-01-17  8:47 ` [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-01-17  8:47 ` Juri Lelli
  2019-01-18 16:17 ` [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Tejun Heo
  5 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-01-17  8:47 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v5 -> v6: Since the list of tasks that belongs to each cgroup is lazily
built only the first time cgroups are mounted, to be able to use it for
restoring deadline accounting information, we need to make sure we build
it also when cgroups support is enabled, but cgroups are not yet mounted
(early hotplugging of CPUs might verify this condition).
---
 include/linux/cgroup.h         |  1 +
 include/linux/sched.h          |  5 +++
 include/linux/sched/deadline.h |  8 +++++
 kernel/cgroup/cgroup.c         |  2 +-
 kernel/cgroup/cpuset.c         | 64 +++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 31 ++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 15 ++++++--
 8 files changed, 122 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d12757a65b0..ffa97eea5dfd 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,6 +141,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
+void cgroup_enable_task_cg_lists(void);
 void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a51c13c2b1a0..cb3b14f6069b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -280,6 +280,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 879c9f191f66..bc51786a4ae6 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1802,7 +1802,7 @@ static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
  */
 static bool use_task_css_set_links __read_mostly;
 
-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
 {
 	struct task_struct *p, *g;
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8ee1b24b6353..338cd5149436 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -934,6 +935,67 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	lockdep_assert_held(&cpuset_mutex);
+	lockdep_assert_cpus_held();
+	lockdep_assert_held(&sched_domains_mutex);
+
+	cgroup_enable_task_cg_lists();
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -971,7 +1033,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91e4202b0634..63d36958cbba 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2285,6 +2285,37 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+	dl_b->total_bw += p->dl.dl_bw;
+
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 618577fc9aa8..f34cc648c613 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -757,9 +757,6 @@ struct root_domain {
 	unsigned long		max_cpu_capacity;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dd94000cb378..dc3468b4b883 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1961,9 +1961,20 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	/* Destroy deleted domains: */
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
-			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			if (cpumask_equal(doms_cur[i], doms_new[j]) &&
+			    dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
  2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (4 preceding siblings ...)
  2019-01-17  8:47 ` [PATCH v6 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2019-01-18 16:17 ` Tejun Heo
  2019-01-18 16:46   ` Juri Lelli
  5 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2019-01-18 16:17 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thu, Jan 17, 2019 at 09:47:34AM +0100, Juri Lelli wrote:
> Hi,
> 
> v6 of a series of patches, originally authored by Mathieu, with the intent
> of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> As originally reported by Steve [1], when hotplug and/or (certain)
> cpuset reconfiguration operations take place, DEADLINE bandwidth
> accounting information is lost since root domains are destroyed and
> recreated.
> 
> Mathieu's approach is based on restoring bandwidth accounting info on
> the newly created root domains by iterating through the (DEADLINE) tasks
> belonging to the configured cpuset(s).
> 
> Apart from some minor refactoring needed to rebase the set on top of
> Waiman Long's cpuset for cgroup series (now mainline), two changes worth
> of notice:

Generally looks good to me but can you please ask Waiman to take a
look?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
  2019-01-18 16:17 ` [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Tejun Heo
@ 2019-01-18 16:46   ` Juri Lelli
  2019-02-04  9:02     ` Juri Lelli
  0 siblings, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2019-01-18 16:46 UTC (permalink / raw)
  To: longman
  Cc: peterz, mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	tj

On 18/01/19 08:17, Tejun Heo wrote:
> On Thu, Jan 17, 2019 at 09:47:34AM +0100, Juri Lelli wrote:
> > Hi,
> > 
> > v6 of a series of patches, originally authored by Mathieu, with the intent
> > of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> > As originally reported by Steve [1], when hotplug and/or (certain)
> > cpuset reconfiguration operations take place, DEADLINE bandwidth
> > accounting information is lost since root domains are destroyed and
> > recreated.
> > 
> > Mathieu's approach is based on restoring bandwidth accounting info on
> > the newly created root domains by iterating through the (DEADLINE) tasks
> > belonging to the configured cpuset(s).
> > 
> > Apart from some minor refactoring needed to rebase the set on top of
> > Waiman Long's cpuset for cgroup series (now mainline), two changes worth
> > of notice:
> 
> Generally looks good to me but can you please ask Waiman to take a
> look?

Argh! I should have cc-ed him in the first instance.

Thanks for reviewing.

Waiman, do you see anything wrong with this series? Thanks!

https://lore.kernel.org/lkml/20190117084739.17078-1-juri.lelli@redhat.com/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
  2019-01-18 16:46   ` Juri Lelli
@ 2019-02-04  9:02     ` Juri Lelli
  2019-02-04 12:18       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2019-02-04  9:02 UTC (permalink / raw)
  To: longman
  Cc: peterz, mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	tj

On 18/01/19 17:46, Juri Lelli wrote:
> On 18/01/19 08:17, Tejun Heo wrote:
> > On Thu, Jan 17, 2019 at 09:47:34AM +0100, Juri Lelli wrote:
> > > Hi,
> > > 
> > > v6 of a series of patches, originally authored by Mathieu, with the intent
> > > of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> > > As originally reported by Steve [1], when hotplug and/or (certain)
> > > cpuset reconfiguration operations take place, DEADLINE bandwidth
> > > accounting information is lost since root domains are destroyed and
> > > recreated.
> > > 
> > > Mathieu's approach is based on restoring bandwidth accounting info on
> > > the newly created root domains by iterating through the (DEADLINE) tasks
> > > belonging to the configured cpuset(s).
> > > 
> > > Apart from some minor refactoring needed to rebase the set on top of
> > > Waiman Long's cpuset for cgroup series (now mainline), two changes worth
> > > of notice:
> > 
> > Generally looks good to me but can you please ask Waiman to take a
> > look?
> 
> Argh! I should have cc-ed him in the first instance.
> 
> Thanks for reviewing.
> 
> Waiman, do you see anything wrong with this series? Thanks!
> 
> https://lore.kernel.org/lkml/20190117084739.17078-1-juri.lelli@redhat.com/

Ping?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw
  2019-01-17  8:47 ` [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
@ 2019-02-04 11:55   ` Peter Zijlstra
  2019-02-05  9:18     ` Juri Lelli
  2019-02-04 12:02   ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-02-04 11:55 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thu, Jan 17, 2019 at 09:47:37AM +0100, Juri Lelli wrote:
> @@ -2366,7 +2366,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
>  	cpuset_filetype_t type = seq_cft(sf)->private;
>  	int ret = 0;
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  
>  	switch (type) {
>  	case FILE_CPULIST:
> @@ -2388,7 +2388,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
>  		ret = -EINVAL;
>  	}
>  
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  	return ret;
>  }
>  
could we not convert this to use cpuset_mutex ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw
  2019-01-17  8:47 ` [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
  2019-02-04 11:55   ` Peter Zijlstra
@ 2019-02-04 12:02   ` Peter Zijlstra
  2019-02-04 12:07     ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-02-04 12:02 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thu, Jan 17, 2019 at 09:47:37AM +0100, Juri Lelli wrote:
> @@ -3233,11 +3233,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&callback_lock, flags);
> +	raw_spin_lock_irqsave(&callback_lock, flags);
>  	rcu_read_lock();
>  	guarantee_online_cpus(task_cs(tsk), pmask);
>  	rcu_read_unlock();
> -	spin_unlock_irqrestore(&callback_lock, flags);
> +	raw_spin_unlock_irqrestore(&callback_lock, flags);
>  }
>  
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> @@ -3285,11 +3285,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
>  	nodemask_t mask;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&callback_lock, flags);
> +	raw_spin_lock_irqsave(&callback_lock, flags);
>  	rcu_read_lock();
>  	guarantee_online_mems(task_cs(tsk), &mask);
>  	rcu_read_unlock();
> -	spin_unlock_irqrestore(&callback_lock, flags);
> +	raw_spin_unlock_irqrestore(&callback_lock, flags);
>  
>  	return mask;
>  }
> @@ -3381,14 +3381,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
>  		return true;
>  
>  	/* Not hardwall and node outside mems_allowed: scan up cpusets */
> -	spin_lock_irqsave(&callback_lock, flags);
> +	raw_spin_lock_irqsave(&callback_lock, flags);
>  
>  	rcu_read_lock();
>  	cs = nearest_hardwall_ancestor(task_cs(current));
>  	allowed = node_isset(node, cs->mems_allowed);
>  	rcu_read_unlock();
>  
> -	spin_unlock_irqrestore(&callback_lock, flags);
> +	raw_spin_unlock_irqrestore(&callback_lock, flags);
>  	return allowed;
>  }

These three appear to be a user-controlled O(n) (depth of cgroup tree).
Which is basically bad for raw_spinlock_t.

The Changelog should really have mentioned this; and ideally we'd
somehow avoid this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw
  2019-02-04 12:02   ` Peter Zijlstra
@ 2019-02-04 12:07     ` Peter Zijlstra
  2019-02-05  9:18       ` Juri Lelli
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-02-04 12:07 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Mon, Feb 04, 2019 at 01:02:36PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 17, 2019 at 09:47:37AM +0100, Juri Lelli wrote:
> > @@ -3233,11 +3233,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> >  {
> >  	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&callback_lock, flags);
> > +	raw_spin_lock_irqsave(&callback_lock, flags);
> >  	rcu_read_lock();
> >  	guarantee_online_cpus(task_cs(tsk), pmask);
> >  	rcu_read_unlock();
> > -	spin_unlock_irqrestore(&callback_lock, flags);
> > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> >  }
> >  
> >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > @@ -3285,11 +3285,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> >  	nodemask_t mask;
> >  	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&callback_lock, flags);
> > +	raw_spin_lock_irqsave(&callback_lock, flags);
> >  	rcu_read_lock();
> >  	guarantee_online_mems(task_cs(tsk), &mask);
> >  	rcu_read_unlock();
> > -	spin_unlock_irqrestore(&callback_lock, flags);
> > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> >  
> >  	return mask;
> >  }
> > @@ -3381,14 +3381,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
> >  		return true;
> >  
> >  	/* Not hardwall and node outside mems_allowed: scan up cpusets */
> > -	spin_lock_irqsave(&callback_lock, flags);
> > +	raw_spin_lock_irqsave(&callback_lock, flags);
> >  
> >  	rcu_read_lock();
> >  	cs = nearest_hardwall_ancestor(task_cs(current));
> >  	allowed = node_isset(node, cs->mems_allowed);
> >  	rcu_read_unlock();
> >  
> > -	spin_unlock_irqrestore(&callback_lock, flags);
> > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> >  	return allowed;
> >  }
> 
> These three appear to be a user-controlled O(n) (depth of cgroup tree).
> Which is basically bad for raw_spinlock_t.
> 
> The Changelog should really have mentioned this; and ideally we'd
> somehow avoid this.

N/m avoiding it; we have this all over the place, just mention it..

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-01-17  8:47 ` [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-02-04 12:10   ` Peter Zijlstra
  2019-02-05  9:51     ` Juri Lelli
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-02-04 12:10 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Thu, Jan 17, 2019 at 09:47:38AM +0100, Juri Lelli wrote:
> No synchronisation mechanism exists between the cpuset subsystem and calls
> to function __sched_setscheduler(). As such, it is possible that new root
> domains are created on the cpuset side while a deadline acceptance test
> is carried out in __sched_setscheduler(), leading to a potential oversell
> of CPU bandwidth.
> 
> Grab callback_lock from core scheduler, so to prevent situations such as
> the one described above from happening.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5263383170e..d928a42b8852 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4224,6 +4224,13 @@ static int __sched_setscheduler(struct task_struct *p,
>  	rq = task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
>  
> +	/*
> +	 * Make sure we don't race with the cpuset subsystem where root
> +	 * domains can be rebuilt or modified while operations like DL
> +	 * admission checks are carried out.
> +	 */
> +	cpuset_read_only_lock();
> +
>  	/*
>  	 * Changing the policy of the stop threads its a very bad idea:
>  	 */
> @@ -4285,6 +4292,7 @@ static int __sched_setscheduler(struct task_struct *p,
>  	/* Re-check policy now with rq lock held: */
>  	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
>  		policy = oldpolicy = -1;
> +		cpuset_read_only_unlock();
>  		task_rq_unlock(rq, p, &rf);
>  		goto recheck;
>  	}
> @@ -4342,6 +4350,7 @@ static int __sched_setscheduler(struct task_struct *p,
>  
>  	/* Avoid rq from going away on us: */
>  	preempt_disable();
> +	cpuset_read_only_unlock();
>  	task_rq_unlock(rq, p, &rf);
>  
>  	if (pi)
> @@ -4354,6 +4363,7 @@ static int __sched_setscheduler(struct task_struct *p,
>  	return 0;
>  
>  unlock:
> +	cpuset_read_only_unlock();
>  	task_rq_unlock(rq, p, &rf);
>  	return retval;
>  }

Why take callback_lock inside rq->lock and not the other way around?
AFAICT there is no pre-existing order so we can pick one here.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
  2019-02-04  9:02     ` Juri Lelli
@ 2019-02-04 12:18       ` Peter Zijlstra
  2019-02-04 18:45         ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-02-04 12:18 UTC (permalink / raw)
  To: Juri Lelli
  Cc: longman, mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	tj

On Mon, Feb 04, 2019 at 10:02:11AM +0100, Juri Lelli wrote:
> On 18/01/19 17:46, Juri Lelli wrote:
> > On 18/01/19 08:17, Tejun Heo wrote:
> > > On Thu, Jan 17, 2019 at 09:47:34AM +0100, Juri Lelli wrote:
> > > > Hi,
> > > > 
> > > > v6 of a series of patches, originally authored by Mathieu, with the intent
> > > > of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> > > > As originally reported by Steve [1], when hotplug and/or (certain)
> > > > cpuset reconfiguration operations take place, DEADLINE bandwidth
> > > > accounting information is lost since root domains are destroyed and
> > > > recreated.
> > > > 
> > > > Mathieu's approach is based on restoring bandwidth accounting info on
> > > > the newly created root domains by iterating through the (DEADLINE) tasks
> > > > belonging to the configured cpuset(s).
> > > > 
> > > > Apart from some minor refactoring needed to rebase the set on top of
> > > > Waiman Long's cpuset for cgroup series (now mainline), two changes worth
> > > > of notice:
> > > 
> > > Generally looks good to me but can you please ask Waiman to take a
> > > look?
> > 
> > Argh! I should have cc-ed him in the first instance.
> > 
> > Thanks for reviewing.
> > 
> > Waiman, do you see anything wrong with this series? Thanks!
> > 
> > https://lore.kernel.org/lkml/20190117084739.17078-1-juri.lelli@redhat.com/
> 
> Ping?

Basically looks OK to me; wlthough I think I prefer the callback_lock /
rq->lock ordering to be the other way around.

Waiman, you OK with this one?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
  2019-02-04 12:18       ` Peter Zijlstra
@ 2019-02-04 18:45         ` Waiman Long
  2019-02-05  9:18           ` Juri Lelli
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2019-02-04 18:45 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli
  Cc: mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups,
	tj

On 02/04/2019 07:18 AM, Peter Zijlstra wrote:
> On Mon, Feb 04, 2019 at 10:02:11AM +0100, Juri Lelli wrote:
>> On 18/01/19 17:46, Juri Lelli wrote:
>>> On 18/01/19 08:17, Tejun Heo wrote:
>>>> On Thu, Jan 17, 2019 at 09:47:34AM +0100, Juri Lelli wrote:
>>>>> Hi,
>>>>>
>>>>> v6 of a series of patches, originally authored by Mathieu, with the intent
>>>>> of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
>>>>> As originally reported by Steve [1], when hotplug and/or (certain)
>>>>> cpuset reconfiguration operations take place, DEADLINE bandwidth
>>>>> accounting information is lost since root domains are destroyed and
>>>>> recreated.
>>>>>
>>>>> Mathieu's approach is based on restoring bandwidth accounting info on
>>>>> the newly created root domains by iterating through the (DEADLINE) tasks
>>>>> belonging to the configured cpuset(s).
>>>>>
>>>>> Apart from some minor refactoring needed to rebase the set on top of
>>>>> Waiman Long's cpuset for cgroup series (now mainline), two changes worth
>>>>> of notice:
>>>> Generally looks good to me but can you please ask Waiman to take a
>>>> look?
>>> Argh! I should have cc-ed him in the first instance.
>>>
>>> Thanks for reviewing.
>>>
>>> Waiman, do you see anything wrong with this series? Thanks!
>>>
>>> https://lore.kernel.org/lkml/20190117084739.17078-1-juri.lelli@redhat.com/
>> Ping?
> Basically looks OK to me; wlthough I think I prefer the callback_lock /
> rq->lock ordering to be the other way around.
>
> Waiman, you OK with this one?

Sorry for the late reply. I reviewed the patchset and don't see anything
wrong with it. However, my knowledge of the internal operation of the
deadline scheduler is limited.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw
  2019-02-04 11:55   ` Peter Zijlstra
@ 2019-02-05  9:18     ` Juri Lelli
  0 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-02-05  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 04/02/19 12:55, Peter Zijlstra wrote:
> On Thu, Jan 17, 2019 at 09:47:37AM +0100, Juri Lelli wrote:
> > @@ -2366,7 +2366,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
> >  	cpuset_filetype_t type = seq_cft(sf)->private;
> >  	int ret = 0;
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  
> >  	switch (type) {
> >  	case FILE_CPULIST:
> > @@ -2388,7 +2388,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
> >  		ret = -EINVAL;
> >  	}
> >  
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  	return ret;
> >  }
> >  
> could we not convert this to use cpuset_mutex ?

Looks like we can.

I'll do the change.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw
  2019-02-04 12:07     ` Peter Zijlstra
@ 2019-02-05  9:18       ` Juri Lelli
  0 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-02-05  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 04/02/19 13:07, Peter Zijlstra wrote:
> On Mon, Feb 04, 2019 at 01:02:36PM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 17, 2019 at 09:47:37AM +0100, Juri Lelli wrote:
> > > @@ -3233,11 +3233,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> > >  {
> > >  	unsigned long flags;
> > >  
> > > -	spin_lock_irqsave(&callback_lock, flags);
> > > +	raw_spin_lock_irqsave(&callback_lock, flags);
> > >  	rcu_read_lock();
> > >  	guarantee_online_cpus(task_cs(tsk), pmask);
> > >  	rcu_read_unlock();
> > > -	spin_unlock_irqrestore(&callback_lock, flags);
> > > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> > >  }
> > >  
> > >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > > @@ -3285,11 +3285,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> > >  	nodemask_t mask;
> > >  	unsigned long flags;
> > >  
> > > -	spin_lock_irqsave(&callback_lock, flags);
> > > +	raw_spin_lock_irqsave(&callback_lock, flags);
> > >  	rcu_read_lock();
> > >  	guarantee_online_mems(task_cs(tsk), &mask);
> > >  	rcu_read_unlock();
> > > -	spin_unlock_irqrestore(&callback_lock, flags);
> > > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> > >  
> > >  	return mask;
> > >  }
> > > @@ -3381,14 +3381,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
> > >  		return true;
> > >  
> > >  	/* Not hardwall and node outside mems_allowed: scan up cpusets */
> > > -	spin_lock_irqsave(&callback_lock, flags);
> > > +	raw_spin_lock_irqsave(&callback_lock, flags);
> > >  
> > >  	rcu_read_lock();
> > >  	cs = nearest_hardwall_ancestor(task_cs(current));
> > >  	allowed = node_isset(node, cs->mems_allowed);
> > >  	rcu_read_unlock();
> > >  
> > > -	spin_unlock_irqrestore(&callback_lock, flags);
> > > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> > >  	return allowed;
> > >  }
> > 
> > These three appear to be a user-controlled O(n) (depth of cgroup tree).
> > Which is basically bad for raw_spinlock_t.
> > 
> > The Changelog should really have mentioned this; and ideally we'd
> > somehow avoid this.
> 
> N/m avoiding it; we have this all over the place, just mention it..

OK.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting
  2019-02-04 18:45         ` Waiman Long
@ 2019-02-05  9:18           ` Juri Lelli
  0 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-02-05  9:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, mingo, rostedt, linux-kernel, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, lizefan,
	cgroups, tj

On 04/02/19 13:45, Waiman Long wrote:
> On 02/04/2019 07:18 AM, Peter Zijlstra wrote:
> > On Mon, Feb 04, 2019 at 10:02:11AM +0100, Juri Lelli wrote:
> >> On 18/01/19 17:46, Juri Lelli wrote:
> >>> On 18/01/19 08:17, Tejun Heo wrote:
> >>>> On Thu, Jan 17, 2019 at 09:47:34AM +0100, Juri Lelli wrote:
> >>>>> Hi,
> >>>>>
> >>>>> v6 of a series of patches, originally authored by Mathieu, with the intent
> >>>>> of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> >>>>> As originally reported by Steve [1], when hotplug and/or (certain)
> >>>>> cpuset reconfiguration operations take place, DEADLINE bandwidth
> >>>>> accounting information is lost since root domains are destroyed and
> >>>>> recreated.
> >>>>>
> >>>>> Mathieu's approach is based on restoring bandwidth accounting info on
> >>>>> the newly created root domains by iterating through the (DEADLINE) tasks
> >>>>> belonging to the configured cpuset(s).
> >>>>>
> >>>>> Apart from some minor refactoring needed to rebase the set on top of
> >>>>> Waiman Long's cpuset for cgroup series (now mainline), two changes worth
> >>>>> of notice:
> >>>> Generally looks good to me but can you please ask Waiman to take a
> >>>> look?
> >>> Argh! I should have cc-ed him in the first instance.
> >>>
> >>> Thanks for reviewing.
> >>>
> >>> Waiman, do you see anything wrong with this series? Thanks!
> >>>
> >>> https://lore.kernel.org/lkml/20190117084739.17078-1-juri.lelli@redhat.com/
> >> Ping?
> > Basically looks OK to me; wlthough I think I prefer the callback_lock /
> > rq->lock ordering to be the other way around.
> >
> > Waiman, you OK with this one?
> 
> Sorry for the late reply. I reviewed the patchset and don't see anything
> wrong with it. However, my knowledge of the internal operation of the
> deadline scheduler is limited.

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-02-04 12:10   ` Peter Zijlstra
@ 2019-02-05  9:51     ` Juri Lelli
  2019-02-05 11:20       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2019-02-05  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 04/02/19 13:10, Peter Zijlstra wrote:
> On Thu, Jan 17, 2019 at 09:47:38AM +0100, Juri Lelli wrote:
> > No synchronisation mechanism exists between the cpuset subsystem and calls
> > to function __sched_setscheduler(). As such, it is possible that new root
> > domains are created on the cpuset side while a deadline acceptance test
> > is carried out in __sched_setscheduler(), leading to a potential oversell
> > of CPU bandwidth.
> > 
> > Grab callback_lock from core scheduler, so to prevent situations such as
> > the one described above from happening.
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f5263383170e..d928a42b8852 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4224,6 +4224,13 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	rq = task_rq_lock(p, &rf);
> >  	update_rq_clock(rq);
> >  
> > +	/*
> > +	 * Make sure we don't race with the cpuset subsystem where root
> > +	 * domains can be rebuilt or modified while operations like DL
> > +	 * admission checks are carried out.
> > +	 */
> > +	cpuset_read_only_lock();
> > +
> >  	/*
> >  	 * Changing the policy of the stop threads its a very bad idea:
> >  	 */
> > @@ -4285,6 +4292,7 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	/* Re-check policy now with rq lock held: */
> >  	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
> >  		policy = oldpolicy = -1;
> > +		cpuset_read_only_unlock();
> >  		task_rq_unlock(rq, p, &rf);
> >  		goto recheck;
> >  	}
> > @@ -4342,6 +4350,7 @@ static int __sched_setscheduler(struct task_struct *p,
> >  
> >  	/* Avoid rq from going away on us: */
> >  	preempt_disable();
> > +	cpuset_read_only_unlock();
> >  	task_rq_unlock(rq, p, &rf);
> >  
> >  	if (pi)
> > @@ -4354,6 +4363,7 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	return 0;
> >  
> >  unlock:
> > +	cpuset_read_only_unlock();
> >  	task_rq_unlock(rq, p, &rf);
> >  	return retval;
> >  }
> 
> Why take callback_lock inside rq->lock and not the other way around?
> AFAICT there is no pre-existing order so we can pick one here.

I dediced to go for this order because if we do the other way around
grabbing callback_lock should have to also disable irqs, no? And I
didn't want to modify task_rq_lock; or at least this approach seemed
less intrusive code-wide.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-02-05  9:51     ` Juri Lelli
@ 2019-02-05 11:20       ` Peter Zijlstra
  2019-02-05 11:49         ` Juri Lelli
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-02-05 11:20 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Tue, Feb 05, 2019 at 10:51:43AM +0100, Juri Lelli wrote:
> On 04/02/19 13:10, Peter Zijlstra wrote:
> > On Thu, Jan 17, 2019 at 09:47:38AM +0100, Juri Lelli wrote:
> > > No synchronisation mechanism exists between the cpuset subsystem and calls
> > > to function __sched_setscheduler(). As such, it is possible that new root
> > > domains are created on the cpuset side while a deadline acceptance test
> > > is carried out in __sched_setscheduler(), leading to a potential oversell
> > > of CPU bandwidth.
> > > 
> > > Grab callback_lock from core scheduler, so to prevent situations such as
> > > the one described above from happening.
> > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5263383170e..d928a42b8852 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4224,6 +4224,13 @@ static int __sched_setscheduler(struct task_struct *p,
> > >  	rq = task_rq_lock(p, &rf);
> > >  	update_rq_clock(rq);
> > >  
> > > +	/*
> > > +	 * Make sure we don't race with the cpuset subsystem where root
> > > +	 * domains can be rebuilt or modified while operations like DL
> > > +	 * admission checks are carried out.
> > > +	 */
> > > +	cpuset_read_only_lock();
> > > +
> > >  	/*
> > >  	 * Changing the policy of the stop threads its a very bad idea:
> > >  	 */
> > > @@ -4285,6 +4292,7 @@ static int __sched_setscheduler(struct task_struct *p,
> > >  	/* Re-check policy now with rq lock held: */
> > >  	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
> > >  		policy = oldpolicy = -1;
> > > +		cpuset_read_only_unlock();
> > >  		task_rq_unlock(rq, p, &rf);
> > >  		goto recheck;
> > >  	}
> > > @@ -4342,6 +4350,7 @@ static int __sched_setscheduler(struct task_struct *p,
> > >  
> > >  	/* Avoid rq from going away on us: */
> > >  	preempt_disable();
> > > +	cpuset_read_only_unlock();
> > >  	task_rq_unlock(rq, p, &rf);
> > >  
> > >  	if (pi)
> > > @@ -4354,6 +4363,7 @@ static int __sched_setscheduler(struct task_struct *p,
> > >  	return 0;
> > >  
> > >  unlock:
> > > +	cpuset_read_only_unlock();
> > >  	task_rq_unlock(rq, p, &rf);
> > >  	return retval;
> > >  }
> > 
> > Why take callback_lock inside rq->lock and not the other way around?
> > AFAICT there is no pre-existing order so we can pick one here.
> 
> I dediced to go for this order because if we do the other way around
> grabbing callback_lock should have to also disable irqs, no? And I
> didn't want to modify task_rq_lock; or at least this approach seemed
> less intrusive code-wide.

Ah, but this way around we add the wait-time of callback_lock to
rq_lock, which seems undesirable because rq_lock is a far hotter lock in
general, right?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-02-05 11:20       ` Peter Zijlstra
@ 2019-02-05 11:49         ` Juri Lelli
  0 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2019-02-05 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 05/02/19 12:20, Peter Zijlstra wrote:
> On Tue, Feb 05, 2019 at 10:51:43AM +0100, Juri Lelli wrote:
> > On 04/02/19 13:10, Peter Zijlstra wrote:
> > > On Thu, Jan 17, 2019 at 09:47:38AM +0100, Juri Lelli wrote:
> > > > No synchronisation mechanism exists between the cpuset subsystem and calls
> > > > to function __sched_setscheduler(). As such, it is possible that new root
> > > > domains are created on the cpuset side while a deadline acceptance test
> > > > is carried out in __sched_setscheduler(), leading to a potential oversell
> > > > of CPU bandwidth.
> > > > 
> > > > Grab callback_lock from core scheduler, so to prevent situations such as
> > > > the one described above from happening.
> > > 
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index f5263383170e..d928a42b8852 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -4224,6 +4224,13 @@ static int __sched_setscheduler(struct task_struct *p,
> > > >  	rq = task_rq_lock(p, &rf);
> > > >  	update_rq_clock(rq);
> > > >  
> > > > +	/*
> > > > +	 * Make sure we don't race with the cpuset subsystem where root
> > > > +	 * domains can be rebuilt or modified while operations like DL
> > > > +	 * admission checks are carried out.
> > > > +	 */
> > > > +	cpuset_read_only_lock();
> > > > +
> > > >  	/*
> > > >  	 * Changing the policy of the stop threads its a very bad idea:
> > > >  	 */
> > > > @@ -4285,6 +4292,7 @@ static int __sched_setscheduler(struct task_struct *p,
> > > >  	/* Re-check policy now with rq lock held: */
> > > >  	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
> > > >  		policy = oldpolicy = -1;
> > > > +		cpuset_read_only_unlock();
> > > >  		task_rq_unlock(rq, p, &rf);
> > > >  		goto recheck;
> > > >  	}
> > > > @@ -4342,6 +4350,7 @@ static int __sched_setscheduler(struct task_struct *p,
> > > >  
> > > >  	/* Avoid rq from going away on us: */
> > > >  	preempt_disable();
> > > > +	cpuset_read_only_unlock();
> > > >  	task_rq_unlock(rq, p, &rf);
> > > >  
> > > >  	if (pi)
> > > > @@ -4354,6 +4363,7 @@ static int __sched_setscheduler(struct task_struct *p,
> > > >  	return 0;
> > > >  
> > > >  unlock:
> > > > +	cpuset_read_only_unlock();
> > > >  	task_rq_unlock(rq, p, &rf);
> > > >  	return retval;
> > > >  }
> > > 
> > > Why take callback_lock inside rq->lock and not the other way around?
> > > AFAICT there is no pre-existing order so we can pick one here.
> > 
> > I dediced to go for this order because if we do the other way around
> > grabbing callback_lock should have to also disable irqs, no? And I
> > didn't want to modify task_rq_lock; or at least this approach seemed
> > less intrusive code-wide.
> 
> Ah, but this way around we add the wait-time of callback_lock to
> rq_lock, which seems undesirable because rq_lock is a far hotter lock in
> general, right?

Eh, indeed. OK, I'll work on it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-02-05 11:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-17  8:47 [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2019-01-17  8:47 ` [PATCH v6 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2019-01-17  8:47 ` [PATCH v6 2/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2019-01-17  8:47 ` [PATCH v6 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
2019-02-04 11:55   ` Peter Zijlstra
2019-02-05  9:18     ` Juri Lelli
2019-02-04 12:02   ` Peter Zijlstra
2019-02-04 12:07     ` Peter Zijlstra
2019-02-05  9:18       ` Juri Lelli
2019-01-17  8:47 ` [PATCH v6 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2019-02-04 12:10   ` Peter Zijlstra
2019-02-05  9:51     ` Juri Lelli
2019-02-05 11:20       ` Peter Zijlstra
2019-02-05 11:49         ` Juri Lelli
2019-01-17  8:47 ` [PATCH v6 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
2019-01-18 16:17 ` [PATCH v6 0/5] sched/deadline: fix cpusets bandwidth accounting Tejun Heo
2019-01-18 16:46   ` Juri Lelli
2019-02-04  9:02     ` Juri Lelli
2019-02-04 12:18       ` Peter Zijlstra
2019-02-04 18:45         ` Waiman Long
2019-02-05  9:18           ` Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).