All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/7]  sched/deadline: fix cpusets bandwidth accounting
@ 2019-04-03  8:46 Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

Hi,

v7 of a series of patches, originally authored by Mathieu, with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

Apart from the usual rebase on top of cgroup/for-next, changes worth of
notice are:

 - commented about potential problems w.r.t. userspace operations that
   grabs callback_lock (03/07 - peterz)
 - take cpuset_read_only_lock() before rq an pi locks, as to not
   introduce a dependency between the former and the latters; on !CONFIG_
   CPUSET cpuset_read_only(un)lock() (dis/en)ables irqs and preemption
   (04/07 - peterz)
 - make dl_add_task_root_domain() use !_irqsave (un)lock functions
 - also use __dl_add() to update root domain total_bw
 - 06/07 new patch; seq_show operations can be guarded by cpuset_mutex
   (peterz)
 - 07/07 new patch; deals with offline migrations (noticed the problem
   while testing)

Set also available at

 https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v7

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966

Juri Lelli (5):
  cgroup/cpuset: make callback_lock raw
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()
  cpuset: Rebuild root domain deadline accounting information
  cgroup/cpuset: Use cpuset_mutex to protect seq_show operations
  sched/deadline: Fix bandwidth accounting at all levels after offline
    migration

Mathieu Poirier (2):
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()

 include/linux/cgroup.h         |   1 +
 include/linux/cpuset.h         |  14 +++
 include/linux/sched.h          |   5 +
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |  10 ++
 kernel/cgroup/cgroup.c         |   2 +-
 kernel/cgroup/cpuset.c         | 161 +++++++++++++++++++++++++--------
 kernel/sched/core.c            |  49 +++++++---
 kernel/sched/deadline.c        |  63 +++++++++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  30 +++++-
 11 files changed, 286 insertions(+), 60 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  2019-04-05 12:04   ` Peter Zijlstra
  2019-04-03  8:46 ` [PATCH v7 2/7] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 57c7ed3fe465..e7db602e898c 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -161,6 +161,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -213,6 +217,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ab7f371a3a17..b63ac11d0831 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2154,16 +2154,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	bool __maybe_unused has_eas = false;
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -2246,6 +2246,15 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 2/7] sched/core: Streamlining calls to task_rq_unlock()
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 3/7] cgroup/cpuset: make callback_lock raw Juri Lelli
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ead464a0f2e5..98e835de1e7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4274,8 +4274,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4291,8 +4291,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4305,8 +4305,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4321,8 +4321,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4341,8 +4341,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4398,6 +4398,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 3/7] cgroup/cpuset: make callback_lock raw
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 2/7] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

callback_lock grants the holder read-only access to cpusets.  For fixing
a synchronization issue between cpusets and scheduler core, it is now
required to make callback_lock available to core scheduler code.

Convert callback_lock to raw_spin_lock, so that it will be always safe
to acquire it from atomic context.

Unfortunately, callback_lock guards some user-controlled operations
(e.g., cpuset_cpus_allowed, cpuset_mems_allowed, etc.), which is usually
bad for a raw_spin_lock. Ideally we'd like to somehow avoid this, but we
have this behavior spread all over the place. A price we decide to pay
for the time being.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v6->v7: Added comment in changelog about callback_lock potential
problems w.r.t. userspace ops. [peterz]
---
 kernel/cgroup/cpuset.c | 70 +++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4834c4214e9c..ff9bd5abe613 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -333,7 +333,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
 
@@ -1235,7 +1235,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	 * Newly added CPUs will be removed from effective_cpus and
 	 * newly deleted ones will be added back to effective_cpus.
 	 */
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (adding) {
 		cpumask_or(parent->subparts_cpus,
 			   parent->subparts_cpus, tmp->addmask);
@@ -1254,7 +1254,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 	}
 
 	parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	return cmd == partcmd_update;
 }
@@ -1359,7 +1359,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus &&
@@ -1390,7 +1390,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 					= cpumask_weight(cp->subparts_cpus);
 			}
 		}
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -1508,7 +1508,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			return -EINVAL;
 	}
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 
 	/*
@@ -1519,7 +1519,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			       cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	update_cpumasks_hier(cs, &tmp);
 
@@ -1713,9 +1713,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1783,9 +1783,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -1876,9 +1876,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -2381,7 +2381,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -2403,7 +2403,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	return ret;
 }
 
@@ -2713,14 +2713,14 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 		cs->use_parent_ecpus = true;
 		parent->child_ecpus_count++;
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -2747,12 +2747,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2805,7 +2805,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
 	mutex_lock(&cpuset_mutex);
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2816,7 +2816,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2917,12 +2917,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 {
 	bool is_empty;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2959,10 +2959,10 @@ hotplug_update_tasks(struct cpuset *cs,
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -3117,7 +3117,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		/*
@@ -3137,17 +3137,17 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 			}
 		}
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -3248,11 +3248,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -3300,11 +3300,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	nodemask_t mask;
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -3396,14 +3396,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (2 preceding siblings ...)
  2019-04-03  8:46 ` [PATCH v7 3/7] cgroup/cpuset: make callback_lock raw Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  2019-04-05 12:36   ` Peter Zijlstra
  2019-04-03  8:46 ` [PATCH v7 5/7] cpuset: Rebuild root domain deadline accounting information Juri Lelli
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

No synchronisation mechanism exists between the cpuset subsystem and calls
to function __sched_setscheduler(). As such, it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

Grab callback_lock from core scheduler, so to prevent situations such as
the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v6->v7: take cpuset_read_only_lock before rq and pi locks, as to not
        introdue an unwanted dependency between the former and the
        latters (peterz)
---
 include/linux/cpuset.h | 14 ++++++++++++++
 kernel/cgroup/cpuset.c | 27 ++++++++++++++++++++++++++-
 kernel/sched/core.c    | 27 ++++++++++++++++++++++-----
 3 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..34c58c4dd445 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_only_lock(unsigned long *flags);
+extern void cpuset_read_only_unlock(unsigned long *flags);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,18 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_only_lock(unsigned long *flags)
+{
+	local_irq_save(*flags);
+	preempt_disable();
+}
+
+static inline void cpuset_read_only_unlock(unsigned long *flags)
+{
+	local_irq_restore(*flags);
+	preempt_enable();
+}
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ff9bd5abe613..ca5364f037a1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -318,7 +318,8 @@ static struct cpuset top_cpuset = {
  * __alloc_pages().
  *
  * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
+ * access to cpusets. Mind that callback_lock might be grabbed from other
+ * subsystems as well (via cpuset_read_only_lock()).
  *
  * Now, the task_struct fields mems_allowed and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
@@ -3233,6 +3234,30 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_read_only_lock - Grab the callback_lock from cpuset subsystem.
+ *
+ * Description: As described in full details the comment above cpuset_mutex
+ * and callback_lock definitions, holding callback_lock gives the holder
+ * read-only access to cpusets. Even though it might look counter-intuitive
+ * (as callback_lock is a spinlock), in fact a task must hold both
+ * callback_lock _and_ cpuset_mutex to modify cpusets (write access).
+ */
+void cpuset_read_only_lock(unsigned long *flags)
+	__acquires(&callback_lock)
+{
+	raw_spin_lock_irqsave(&callback_lock, *flags);
+}
+
+/**
+ * cpuset_read_only_unlock - Release the callback_lock from cpuset subsystem.
+ */
+void cpuset_read_only_unlock(unsigned long *flags)
+	__releases(&callback_lock)
+{
+	raw_spin_unlock_irqrestore(&callback_lock, *flags);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98e835de1e7b..543eb31aa243 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4260,14 +4260,25 @@ static int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
+	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_read_only_lock(&rf.flags);
+
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
-	 *
+	 */
+
+	raw_spin_lock(&p->pi_lock);
+
+	/*
 	 * To be able to change p->policy safely, the appropriate
 	 * runqueue lock must be held.
 	 */
-	rq = task_rq_lock(p, &rf);
+	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
 	/*
@@ -4331,7 +4342,9 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Re-check policy now with rq lock held: */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
-		task_rq_unlock(rq, p, &rf);
+		__task_rq_unlock(rq, &rf);
+		raw_spin_unlock(&p->pi_lock);
+		cpuset_read_only_unlock(&rf.flags);
 		goto recheck;
 	}
 
@@ -4388,7 +4401,9 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
-	task_rq_unlock(rq, p, &rf);
+	__task_rq_unlock(rq, &rf);
+	raw_spin_unlock(&p->pi_lock);
+	cpuset_read_only_unlock(&rf.flags);
 
 	if (pi)
 		rt_mutex_adjust_pi(p);
@@ -4400,7 +4415,9 @@ static int __sched_setscheduler(struct task_struct *p,
 	return 0;
 
 unlock:
-	task_rq_unlock(rq, p, &rf);
+	__task_rq_unlock(rq, &rf);
+	raw_spin_unlock(&p->pi_lock);
+	cpuset_read_only_unlock(&rf.flags);
 	return retval;
 }
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 5/7] cpuset: Rebuild root domain deadline accounting information
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (3 preceding siblings ...)
  2019-04-03  8:46 ` [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 6/7] cgroup/cpuset: Use cpuset_mutex to protect seq_show operations Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 7/7] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
  6 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v6 -> v7: make dk_add_task_root_domain() use raw_spin_(un)lock() instead
          of the _irqsave variants as irqs are already disabled by task_
          rq_lock(). Also use __dl_add() to update root domain total_bw.
---
 include/linux/cgroup.h         |  1 +
 include/linux/sched.h          |  5 +++
 include/linux/sched/deadline.h |  8 +++++
 kernel/cgroup/cgroup.c         |  2 +-
 kernel/cgroup/cpuset.c         | 64 +++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 30 ++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 13 ++++++-
 8 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 81f58b4a5418..c73470f073a1 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -144,6 +144,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
+void cgroup_enable_task_cg_lists(void);
 void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..a3d4d3b1db21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -282,6 +282,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3f2b4bde0f9c..55ceb286882f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1839,7 +1839,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
  */
 static bool use_task_css_set_links __read_mostly;
 
-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
 {
 	struct task_struct *p, *g;
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca5364f037a1..cb5ff6ca5d2c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -45,6 +45,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -949,6 +950,67 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	lockdep_assert_held(&cpuset_mutex);
+	lockdep_assert_cpus_held();
+	lockdep_assert_held(&sched_domains_mutex);
+
+	cgroup_enable_task_cg_lists();
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -986,7 +1048,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6a73e41a2016..c8a654b133da 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2284,6 +2284,36 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+
+	raw_spin_unlock(&dl_b->lock);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efa686eeff26..35d551bc2ab6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,9 +783,6 @@ struct root_domain {
 	struct perf_domain	*pd;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b63ac11d0831..2b475198268a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2188,8 +2188,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j]) &&
-			    dattrs_equal(dattr_cur, i, dattr_new, j))
+			    dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 6/7] cgroup/cpuset: Use cpuset_mutex to protect seq_show operations
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (4 preceding siblings ...)
  2019-04-03  8:46 ` [PATCH v7 5/7] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  2019-04-03  8:46 ` [PATCH v7 7/7] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli
  6 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

cpuset_common_seq_show operations atomicity is currently guarded by
callback_lock. Since these operations are initiated by userspace holding
a raw_spin_lock is not wise.

Convert the function to use cpuset_mutex to fix the problem.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/cgroup/cpuset.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cb5ff6ca5d2c..91ef48bbc08d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2444,7 +2444,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	raw_spin_lock_irq(&callback_lock);
+	mutex_lock(&cpuset_mutex);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -2466,7 +2466,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	raw_spin_unlock_irq(&callback_lock);
+	mutex_unlock(&cpuset_mutex);
 	return ret;
 }
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 7/7] sched/deadline: Fix bandwidth accounting at all levels after offline migration
  2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (5 preceding siblings ...)
  2019-04-03  8:46 ` [PATCH v7 6/7] cgroup/cpuset: Use cpuset_mutex to protect seq_show operations Juri Lelli
@ 2019-04-03  8:46 ` Juri Lelli
  6 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-03  8:46 UTC (permalink / raw)
  To: peterz, mingo, rostedt, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c8a654b133da..91be79072845 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -530,6 +530,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
 static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
 {
 	struct rq *later_rq = NULL;
+	struct dl_bw *dl_b;
 
 	later_rq = find_lock_later_rq(p, rq);
 	if (!later_rq) {
@@ -558,6 +559,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 		double_lock_balance(rq, later_rq);
 	}
 
+	if (p->dl.dl_non_contending || p->dl.dl_throttled) {
+		/*
+		 * Inactive timer is armed (or callback is running, but
+		 * waiting for us to release rq locks). In any case, when it
+		 * will file (or continue), it will see running_bw of this
+		 * task migrated to later_rq (and correctly handle it).
+		 */
+		sub_running_bw(&p->dl, &rq->dl);
+		sub_rq_bw(&p->dl, &rq->dl);
+
+		add_rq_bw(&p->dl, &later_rq->dl);
+		add_running_bw(&p->dl, &later_rq->dl);
+	} else {
+		sub_rq_bw(&p->dl, &rq->dl);
+		add_rq_bw(&p->dl, &later_rq->dl);
+	}
+
+	/*
+	 * And we finally need to fixup root_domain(s) bandwidth accounting,
+	 * since p is still hanging out in the old (now moved to default) root
+	 * domain.
+	 */
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
+	dl_b = &later_rq->rd->dl_bw;
+	raw_spin_lock(&dl_b->lock);
+	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+	raw_spin_unlock(&dl_b->lock);
+
 	set_task_cpu(p, later_rq->cpu);
 	double_unlock_balance(later_rq, rq);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2019-04-03  8:46 ` [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2019-04-05 12:04   ` Peter Zijlstra
  2019-04-05 12:52     ` Juri Lelli
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-04-05 12:04 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, Apr 03, 2019 at 10:46:44AM +0200, Juri Lelli wrote:
> +/*
> + * Call with hotplug lock held

Is that spelled like:

	lockdep_assert_cpus_held();

?

> + */
> +void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> +			     struct sched_domain_attr *dattr_new)
> +{
> +	mutex_lock(&sched_domains_mutex);
> +	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
>  	mutex_unlock(&sched_domains_mutex);
>  }
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-04-03  8:46 ` [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2019-04-05 12:36   ` Peter Zijlstra
  2019-04-05 12:53     ` Juri Lelli
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-04-05 12:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Wed, Apr 03, 2019 at 10:46:47AM +0200, Juri Lelli wrote:
> +static inline void cpuset_read_only_lock(unsigned long *flags)
> +{
> +	local_irq_save(*flags);
> +	preempt_disable();
> +}
> +
> +static inline void cpuset_read_only_unlock(unsigned long *flags)
> +{
> +	local_irq_restore(*flags);
> +	preempt_enable();
> +}

You can ditch the preempt stuff. IRQs disabled already very much implies
!preemptible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked()
  2019-04-05 12:04   ` Peter Zijlstra
@ 2019-04-05 12:52     ` Juri Lelli
  0 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-05 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

Hi,

On 05/04/19 14:04, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 10:46:44AM +0200, Juri Lelli wrote:
> > +/*
> > + * Call with hotplug lock held
> 
> Is that spelled like:
> 
> 	lockdep_assert_cpus_held();
> 
> ?

Indeed, but I had that in previous versions and we removed that in v5
because it can lead to false positives [1,2]. IIRC, problem has to do
with hotplug and the fact that rebuilding from hotplug runs the
callbacks on the newly onlined CPU, not the task that acquired the lock.

Best,

- Juri

1 - https://lore.kernel.org/lkml/20180903142801.20046-1-juri.lelli@redhat.com/
2 - https://lore.kernel.org/lkml/20180613121711.5018-2-juri.lelli@redhat.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2019-04-05 12:36   ` Peter Zijlstra
@ 2019-04-05 12:53     ` Juri Lelli
  0 siblings, 0 replies; 12+ messages in thread
From: Juri Lelli @ 2019-04-05 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tj, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

Hi,

On 05/04/19 14:36, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 10:46:47AM +0200, Juri Lelli wrote:
> > +static inline void cpuset_read_only_lock(unsigned long *flags)
> > +{
> > +	local_irq_save(*flags);
> > +	preempt_disable();
> > +}
> > +
> > +static inline void cpuset_read_only_unlock(unsigned long *flags)
> > +{
> > +	local_irq_restore(*flags);
> > +	preempt_enable();
> > +}
> 
> You can ditch the preempt stuff. IRQs disabled already very much implies
> !preemptible.

OK.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-04-05 12:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03  8:46 [PATCH v7 0/7] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2019-04-03  8:46 ` [PATCH v7 1/7] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2019-04-05 12:04   ` Peter Zijlstra
2019-04-05 12:52     ` Juri Lelli
2019-04-03  8:46 ` [PATCH v7 2/7] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2019-04-03  8:46 ` [PATCH v7 3/7] cgroup/cpuset: make callback_lock raw Juri Lelli
2019-04-03  8:46 ` [PATCH v7 4/7] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2019-04-05 12:36   ` Peter Zijlstra
2019-04-05 12:53     ` Juri Lelli
2019-04-03  8:46 ` [PATCH v7 5/7] cpuset: Rebuild root domain deadline accounting information Juri Lelli
2019-04-03  8:46 ` [PATCH v7 6/7] cgroup/cpuset: Use cpuset_mutex to protect seq_show operations Juri Lelli
2019-04-03  8:46 ` [PATCH v7 7/7] sched/deadline: Fix bandwidth accounting at all levels after offline migration Juri Lelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.