linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting
@ 2018-09-03 14:27 Juri Lelli
  2018-09-03 14:27 ` [PATCH v5 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-03 14:27 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

Hi,

v5 of a series of patches, originally authored by Mathieu, with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

Main problem of v4 was caused by the trylocking of cpuset_mutex. As
noticed by Steve [2], if multiple tasks are created at they same time
only the first gets to grab the mutex, the other get -EBUSY and need to
retry. Not really nice. So, in v5 I'm proposing to use callback_lock
instead of cpuset_mutex, which AFAIU should be enough to grant read-only
safe access to cpusets.

01/05 has been dropped because it wasn't really adding much and was
only causing false positives.

05/05 is still too much DEADLINE specific I guess, but let's first agree
on foundations patches.

Set also available at

 https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v5

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966
[2] https://lore.kernel.org/lkml/20180614161142.69f186a6@gandalf.local.home/

Juri Lelli (1):
  cgroup/cpuset: make callback_lock raw

Mathieu Poirier (4):
  sched/topology: Adding function partition_sched_domains_locked()
  sched/core: Streamlining calls to task_rq_unlock()
  sched/core: Prevent race condition between cpuset and
    __sched_setscheduler()
  cpuset: Rebuild root domain deadline accounting information

 include/linux/cpuset.h         |   6 ++
 include/linux/sched.h          |   5 ++
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |  10 +++
 kernel/cgroup/cpuset.c         | 147 +++++++++++++++++++++++++--------
 kernel/sched/core.c            |  34 +++++---
 kernel/sched/deadline.c        |  31 +++++++
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  30 +++++--
 9 files changed, 222 insertions(+), 52 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/5] sched/topology: Adding function partition_sched_domains_locked()
  2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2018-09-03 14:27 ` Juri Lelli
  2018-09-03 14:27 ` [PATCH v5 2/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-03 14:27 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/topology.c        | 17 +++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..57997caf61b6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -162,6 +162,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void partition_sched_domains_locked(int ndoms_new,
+					   cpumask_var_t doms_new[],
+					   struct sched_domain_attr *dattr_new);
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
@@ -206,6 +210,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 
 struct sched_domain_attr;
 
+static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+			       struct sched_domain_attr *dattr_new)
+{
+}
+
 static inline void
 partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 			struct sched_domain_attr *dattr_new)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 56a0fed30c0a..fb7ae691cb82 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1850,15 +1850,15 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * ndoms_new == 0 is a special case for destroying existing domains,
  * and it will not create the default domain.
  *
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
  */
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
-			     struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
 {
 	int i, j, n;
 	int new_topology;
 
-	mutex_lock(&sched_domains_mutex);
+	lockdep_assert_held(&sched_domains_mutex);
 
 	/* Always unregister in case we don't destroy any domains: */
 	unregister_sched_domain_sysctl();
@@ -1923,6 +1923,15 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	register_sched_domain_sysctl();
+}
 
+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+			     struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
 	mutex_unlock(&sched_domains_mutex);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 2/5] sched/core: Streamlining calls to task_rq_unlock()
  2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2018-09-03 14:27 ` [PATCH v5 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
@ 2018-09-03 14:27 ` Juri Lelli
  2018-09-03 14:27 ` [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-03 14:27 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler().  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/sched/core.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index deafa9fe602b..22f5622cba69 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4232,8 +4232,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
 	if (p == rq->stop) {
-		task_rq_unlock(rq, p, &rf);
-		return -EINVAL;
+		retval = -EINVAL;
+		goto unlock;
 	}
 
 	/*
@@ -4249,8 +4249,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
-		task_rq_unlock(rq, p, &rf);
-		return 0;
+		retval = 0;
+		goto unlock;
 	}
 change:
 
@@ -4263,8 +4263,8 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			task_rq_unlock(rq, p, &rf);
-			return -EPERM;
+			retval = -EPERM;
+			goto unlock;
 		}
 #endif
 #ifdef CONFIG_SMP
@@ -4279,8 +4279,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
 			    rq->rd->dl_bw.bw == 0) {
-				task_rq_unlock(rq, p, &rf);
-				return -EPERM;
+				retval = -EPERM;
+				goto unlock;
 			}
 		}
 #endif
@@ -4299,8 +4299,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	 * is available.
 	 */
 	if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
-		task_rq_unlock(rq, p, &rf);
-		return -EBUSY;
+		retval = -EBUSY;
+		goto unlock;
 	}
 
 	p->sched_reset_on_fork = reset_on_fork;
@@ -4356,6 +4356,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	preempt_enable();
 
 	return 0;
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	return retval;
 }
 
 static int _sched_setscheduler(struct task_struct *p, int policy,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2018-09-03 14:27 ` [PATCH v5 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
  2018-09-03 14:27 ` [PATCH v5 2/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
@ 2018-09-03 14:27 ` Juri Lelli
  2018-09-25 14:34   ` Juri Lelli
  2018-09-03 14:28 ` [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-09-03 14:27 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

callback_lock grants the holder read-only access to cpusets.  For fixing
a synchronization issue between cpusets and scheduler core, it is now
required to make callback_lock available to core scheduler code.

Convert callback_lock to raw_spin_lock, so that it will be always safe
to acquire it from atomic context.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/cgroup/cpuset.c | 66 +++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10cb7222..5b43f482fa0f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -288,7 +288,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
 
@@ -922,9 +922,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -989,9 +989,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->cpus_allowed as a temp variable */
 	update_cpumasks_hier(cs, trialcs->cpus_allowed);
@@ -1175,9 +1175,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1245,9 +1245,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -1338,9 +1338,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -1755,7 +1755,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -1774,7 +1774,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	return ret;
 }
 
@@ -1989,12 +1989,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -2021,12 +2021,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2065,7 +2065,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
 	mutex_lock(&cpuset_mutex);
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2076,7 +2076,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2174,12 +2174,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 {
 	bool is_empty;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2216,10 +2216,10 @@ hotplug_update_tasks(struct cpuset *cs,
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -2312,21 +2312,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -2425,11 +2425,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -2477,11 +2477,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	nodemask_t mask;
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -2573,14 +2573,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (2 preceding siblings ...)
  2018-09-03 14:27 ` [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
@ 2018-09-03 14:28 ` Juri Lelli
  2018-10-03 19:42   ` Steven Rostedt
  2018-09-03 14:28 ` [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  2018-09-25  8:14 ` [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  5 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-09-03 14:28 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups, Juri Lelli

From: Mathieu Poirier <mathieu.poirier@linaro.org>

No synchronisation mechanism exists between the cpuset subsystem and calls
to function __sched_setscheduler(). As such, it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

Grab callback_lock from core scheduler, so to prevent situations such as
the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v4->v5: grab callback_lock instead of cpuset_mutex, as callback_lock is
enough to get read-only access to cpusets [1] and it can be easily
converted to be a raw_spinlock (done in previous - new - patch).

[1] https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L275
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
 kernel/sched/core.c    | 10 ++++++++++
 3 files changed, 34 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..8e5a8dd0622b 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_only_lock(void);
+extern void cpuset_read_only_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,10 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_only_lock(void) { }
+
+static inline void cpuset_read_only_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5b43f482fa0f..8dc26005bb1e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2410,6 +2410,24 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_read_only_lock - Grab the callback_lock from another subsysytem
+ *
+ * Description: Gives the holder read-only access to cpusets.
+ */
+void cpuset_read_only_lock(void)
+{
+	raw_spin_lock(&callback_lock);
+}
+
+/**
+ * cpuset_read_only_unlock - Release the callback_lock from another subsysytem
+ */
+void cpuset_read_only_unlock(void)
+{
+	raw_spin_unlock(&callback_lock);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 22f5622cba69..ac11ee599968 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4228,6 +4228,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_read_only_lock();
+
 	/*
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
@@ -4289,6 +4296,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Re-check policy now with rq lock held: */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
+		cpuset_read_only_unlock();
 		task_rq_unlock(rq, p, &rf);
 		goto recheck;
 	}
@@ -4346,6 +4354,7 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
+	cpuset_read_only_unlock();
 	task_rq_unlock(rq, p, &rf);
 
 	if (pi)
@@ -4358,6 +4367,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	return 0;
 
 unlock:
+	cpuset_read_only_unlock();
 	task_rq_unlock(rq, p, &rf);
 	return retval;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information
  2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (3 preceding siblings ...)
  2018-09-03 14:28 ` [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2018-09-03 14:28 ` Juri Lelli
  2018-09-25 12:32   ` Peter Zijlstra
  2018-09-25 12:53   ` Peter Zijlstra
  2018-09-25  8:14 ` [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  5 siblings, 2 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-03 14:28 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

From: Mathieu Poirier <mathieu.poirier@linaro.org>

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch address the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 include/linux/sched.h          |  5 +++
 include/linux/sched/deadline.h |  8 +++++
 kernel/cgroup/cpuset.c         | 63 +++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c        | 31 +++++++++++++++++
 kernel/sched/sched.h           |  3 --
 kernel/sched/topology.c        | 13 ++++++-
 6 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e0f4f56c9310..2bf3edc8658d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,11 @@ struct vtime {
 	u64			gtime;
 };
 
+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8dc26005bb1e..e5d782c5b191 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
 #include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
 #include <linux/seq_file.h>
@@ -813,6 +814,66 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	return ndoms;
 }
 
+static void update_tasks_root_domain(struct cpuset *cs)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	css_task_iter_start(&cs->css, 0, &it);
+
+	while ((task = css_task_iter_next(&it)))
+		dl_add_task_root_domain(task);
+
+	css_task_iter_end(&it);
+}
+
+/*
+ * Called with cpuset_mutex held (rebuild_sched_domains())
+ * Called with hotplug lock held (rebuild_sched_domains_locked())
+ * Called with sched_domains_mutex held (partition_and_rebuild_domains())
+ */
+static void rebuild_root_domains(void)
+{
+	struct cpuset *cs = NULL;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+
+	/*
+	 * Clear default root domain DL accounting, it will be computed again
+	 * if a task belongs to it.
+	 */
+	dl_clear_root_domain(&def_root_domain);
+
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+		if (cpumask_empty(cs->effective_cpus)) {
+			pos_css = css_rightmost_descendant(pos_css);
+			continue;
+		}
+
+		css_get(&cs->css);
+
+		rcu_read_unlock();
+
+		update_tasks_root_domain(cs);
+
+		rcu_read_lock();
+		css_put(&cs->css);
+	}
+	rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+				    struct sched_domain_attr *dattr_new)
+{
+	mutex_lock(&sched_domains_mutex);
+	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+	rebuild_root_domains();
+	mutex_unlock(&sched_domains_mutex);
+}
+
 /*
  * Rebuild scheduler domains.
  *
@@ -845,7 +906,7 @@ static void rebuild_sched_domains_locked(void)
 	ndoms = generate_sched_domains(&doms, &attr);
 
 	/* Have scheduler rebuild the domains */
-	partition_sched_domains(ndoms, doms, attr);
+	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 out:
 	put_online_cpus();
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 997ea7b839fa..5c5938acf89a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2285,6 +2285,37 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+void dl_add_task_root_domain(struct task_struct *p)
+{
+	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
+	struct dl_bw *dl_b;
+
+	rq = task_rq_lock(p, &rf);
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_b = &rq->rd->dl_bw;
+	raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+	dl_b->total_bw += p->dl.dl_bw;
+
+	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+	task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+	rd->dl_bw.total_bw = 0;
+	raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..84215d464dd1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,9 +750,6 @@ struct root_domain {
 	unsigned long		max_cpu_capacity;
 };
 
-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index fb7ae691cb82..08128bdf3944 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1883,8 +1883,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {
+				struct root_domain *rd;
+
+				/*
+				 * This domain won't be destroyed and as such
+				 * its dl_bw->total_bw needs to be cleared.  It
+				 * will be recomputed in function
+				 * update_tasks_root_domain().
+				 */
+				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+				dl_clear_root_domain(rd);
 				goto match1;
+			}
 		}
 		/* No match - a current sched domain not in new doms_new[] */
 		detach_destroy_domains(doms_cur[i]);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting
  2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
                   ` (4 preceding siblings ...)
  2018-09-03 14:28 ` [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2018-09-25  8:14 ` Juri Lelli
  5 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-25  8:14 UTC (permalink / raw)
  To: peterz, mingo, rostedt
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, lizefan, cgroups

Hi,

On 03/09/18 16:27, Juri Lelli wrote:
> Hi,
> 
> v5 of a series of patches, originally authored by Mathieu, with the intent
> of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> As originally reported by Steve [1], when hotplug and/or (certain)
> cpuset reconfiguration operations take place, DEADLINE bandwidth
> accounting information is lost since root domains are destroyed and
> recreated.
> 
> Mathieu's approach is based on restoring bandwidth accounting info on
> the newly created root domains by iterating through the (DEADLINE) tasks
> belonging to the configured cpuset(s).
> 
> Main problem of v4 was caused by the trylocking of cpuset_mutex. As
> noticed by Steve [2], if multiple tasks are created at they same time
> only the first gets to grab the mutex, the other get -EBUSY and need to
> retry. Not really nice. So, in v5 I'm proposing to use callback_lock
> instead of cpuset_mutex, which AFAIU should be enough to grant read-only
> safe access to cpusets.
> 
> 01/05 has been dropped because it wasn't really adding much and was
> only causing false positives.
> 
> 05/05 is still too much DEADLINE specific I guess, but let's first agree
> on foundations patches.
> 
> Set also available at
> 
>  https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v5

Gentle ping about this.

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information
  2018-09-03 14:28 ` [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
@ 2018-09-25 12:32   ` Peter Zijlstra
  2018-09-25 13:07     ` Juri Lelli
  2018-09-25 12:53   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-25 12:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Mon, Sep 03, 2018 at 04:28:01PM +0200, Juri Lelli wrote:
> +/*
> + * Called with cpuset_mutex held (rebuild_sched_domains())
> + * Called with hotplug lock held (rebuild_sched_domains_locked())
> + * Called with sched_domains_mutex held (partition_and_rebuild_domains())

Isn't that what we have lockdep_assert_held() for?

> + */
> +static void rebuild_root_domains(void)
> +{
> +	struct cpuset *cs = NULL;
> +	struct cgroup_subsys_state *pos_css;
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * Clear default root domain DL accounting, it will be computed again
> +	 * if a task belongs to it.
> +	 */
> +	dl_clear_root_domain(&def_root_domain);
> +
> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> +
> +		if (cpumask_empty(cs->effective_cpus)) {
> +			pos_css = css_rightmost_descendant(pos_css);
> +			continue;
> +		}
> +
> +		css_get(&cs->css);
> +
> +		rcu_read_unlock();

That looks really dodgy, but I suppose the comment near
css_next_descendant_pre() spells out that this is in fact OK.

> +		update_tasks_root_domain(cs);
> +
> +		rcu_read_lock();
> +		css_put(&cs->css);
> +	}
> +	rcu_read_unlock();
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information
  2018-09-03 14:28 ` [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
  2018-09-25 12:32   ` Peter Zijlstra
@ 2018-09-25 12:53   ` Peter Zijlstra
  2018-09-25 13:08     ` Juri Lelli
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-25 12:53 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Mon, Sep 03, 2018 at 04:28:01PM +0200, Juri Lelli wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index fb7ae691cb82..08128bdf3944 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1883,8 +1883,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
>  	for (i = 0; i < ndoms_cur; i++) {
>  		for (j = 0; j < n && !new_topology; j++) {
>  			if (cpumask_equal(doms_cur[i], doms_new[j])
> -			    && dattrs_equal(dattr_cur, i, dattr_new, j))
> +			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {

While there, please also fix that wrongly placed operator.

> +				struct root_domain *rd;
> +
> +				/*
> +				 * This domain won't be destroyed and as such
> +				 * its dl_bw->total_bw needs to be cleared.  It
> +				 * will be recomputed in function
> +				 * update_tasks_root_domain().
> +				 */
> +				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
> +				dl_clear_root_domain(rd);
>  				goto match1;
> +			}
>  		}
>  		/* No match - a current sched domain not in new doms_new[] */
>  		detach_destroy_domains(doms_cur[i]);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information
  2018-09-25 12:32   ` Peter Zijlstra
@ 2018-09-25 13:07     ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-25 13:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 25/09/18 14:32, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 04:28:01PM +0200, Juri Lelli wrote:
> > +/*
> > + * Called with cpuset_mutex held (rebuild_sched_domains())
> > + * Called with hotplug lock held (rebuild_sched_domains_locked())
> > + * Called with sched_domains_mutex held (partition_and_rebuild_domains())
> 
> Isn't that what we have lockdep_assert_held() for?

Indeed. I can put three of them inside the function, even though we have
a single path to here atm. Guess makes sense to protect any future change.

> > + */
> > +static void rebuild_root_domains(void)
> > +{
> > +	struct cpuset *cs = NULL;
> > +	struct cgroup_subsys_state *pos_css;
> > +
> > +	rcu_read_lock();
> > +
> > +	/*
> > +	 * Clear default root domain DL accounting, it will be computed again
> > +	 * if a task belongs to it.
> > +	 */
> > +	dl_clear_root_domain(&def_root_domain);
> > +
> > +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> > +
> > +		if (cpumask_empty(cs->effective_cpus)) {
> > +			pos_css = css_rightmost_descendant(pos_css);
> > +			continue;
> > +		}
> > +
> > +		css_get(&cs->css);
> > +
> > +		rcu_read_unlock();
> 
> That looks really dodgy, but I suppose the comment near
> css_next_descendant_pre() spells out that this is in fact OK.

Plus update_cpumasks_hier() seems to do something similar. Maybe I
should switch to use css_tryget_online() as well?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information
  2018-09-25 12:53   ` Peter Zijlstra
@ 2018-09-25 13:08     ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-25 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 25/09/18 14:53, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 04:28:01PM +0200, Juri Lelli wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index fb7ae691cb82..08128bdf3944 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1883,8 +1883,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
> >  	for (i = 0; i < ndoms_cur; i++) {
> >  		for (j = 0; j < n && !new_topology; j++) {
> >  			if (cpumask_equal(doms_cur[i], doms_new[j])
> > -			    && dattrs_equal(dattr_cur, i, dattr_new, j))
> > +			    && dattrs_equal(dattr_cur, i, dattr_new, j)) {
> 
> While there, please also fix that wrongly placed operator.

Sure.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-09-03 14:27 ` [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
@ 2018-09-25 14:34   ` Juri Lelli
  2018-11-07  9:59     ` Juri Lelli
  2018-11-07 15:53     ` Tejun Heo
  0 siblings, 2 replies; 22+ messages in thread
From: Juri Lelli @ 2018-09-25 14:34 UTC (permalink / raw)
  To: lizefan, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, cgroups, peterz, mingo, rostedt

Hi Li Zefan and Tejun Heo,

It would be great if you could please have a look at the proposed change
below (and the rest of the set of course :-).

Another bit that I'd be more comfortable after hearing your word on it
is this one (discussed over 5/5):

https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/

Best,

- Juri

On 03/09/18 16:27, Juri Lelli wrote:
> callback_lock grants the holder read-only access to cpusets.  For fixing
> a synchronization issue between cpusets and scheduler core, it is now
> required to make callback_lock available to core scheduler code.
> 
> Convert callback_lock to raw_spin_lock, so that it will be always safe
> to acquire it from atomic context.
> 
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 66 +++++++++++++++++++++---------------------
>  1 file changed, 33 insertions(+), 33 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 266f10cb7222..5b43f482fa0f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -288,7 +288,7 @@ static struct cpuset top_cpuset = {
>   */
>  
>  static DEFINE_MUTEX(cpuset_mutex);
> -static DEFINE_SPINLOCK(callback_lock);
> +static DEFINE_RAW_SPINLOCK(callback_lock);
>  
>  static struct workqueue_struct *cpuset_migrate_mm_wq;
>  
> @@ -922,9 +922,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
>  			continue;
>  		rcu_read_unlock();
>  
> -		spin_lock_irq(&callback_lock);
> +		raw_spin_lock_irq(&callback_lock);
>  		cpumask_copy(cp->effective_cpus, new_cpus);
> -		spin_unlock_irq(&callback_lock);
> +		raw_spin_unlock_irq(&callback_lock);
>  
>  		WARN_ON(!is_in_v2_mode() &&
>  			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
> @@ -989,9 +989,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>  	if (retval < 0)
>  		return retval;
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  
>  	/* use trialcs->cpus_allowed as a temp variable */
>  	update_cpumasks_hier(cs, trialcs->cpus_allowed);
> @@ -1175,9 +1175,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
>  			continue;
>  		rcu_read_unlock();
>  
> -		spin_lock_irq(&callback_lock);
> +		raw_spin_lock_irq(&callback_lock);
>  		cp->effective_mems = *new_mems;
> -		spin_unlock_irq(&callback_lock);
> +		raw_spin_unlock_irq(&callback_lock);
>  
>  		WARN_ON(!is_in_v2_mode() &&
>  			!nodes_equal(cp->mems_allowed, cp->effective_mems));
> @@ -1245,9 +1245,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
>  	if (retval < 0)
>  		goto done;
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	cs->mems_allowed = trialcs->mems_allowed;
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  
>  	/* use trialcs->mems_allowed as a temp variable */
>  	update_nodemasks_hier(cs, &trialcs->mems_allowed);
> @@ -1338,9 +1338,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
>  	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
>  			|| (is_spread_page(cs) != is_spread_page(trialcs)));
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	cs->flags = trialcs->flags;
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  
>  	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
>  		rebuild_sched_domains_locked();
> @@ -1755,7 +1755,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
>  	cpuset_filetype_t type = seq_cft(sf)->private;
>  	int ret = 0;
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  
>  	switch (type) {
>  	case FILE_CPULIST:
> @@ -1774,7 +1774,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
>  		ret = -EINVAL;
>  	}
>  
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  	return ret;
>  }
>  
> @@ -1989,12 +1989,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
>  
>  	cpuset_inc();
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	if (is_in_v2_mode()) {
>  		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
>  		cs->effective_mems = parent->effective_mems;
>  	}
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  
>  	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
>  		goto out_unlock;
> @@ -2021,12 +2021,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
>  	}
>  	rcu_read_unlock();
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	cs->mems_allowed = parent->mems_allowed;
>  	cs->effective_mems = parent->mems_allowed;
>  	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
>  	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  out_unlock:
>  	mutex_unlock(&cpuset_mutex);
>  	return 0;
> @@ -2065,7 +2065,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
>  static void cpuset_bind(struct cgroup_subsys_state *root_css)
>  {
>  	mutex_lock(&cpuset_mutex);
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  
>  	if (is_in_v2_mode()) {
>  		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
> @@ -2076,7 +2076,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
>  		top_cpuset.mems_allowed = top_cpuset.effective_mems;
>  	}
>  
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  	mutex_unlock(&cpuset_mutex);
>  }
>  
> @@ -2174,12 +2174,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
>  {
>  	bool is_empty;
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	cpumask_copy(cs->cpus_allowed, new_cpus);
>  	cpumask_copy(cs->effective_cpus, new_cpus);
>  	cs->mems_allowed = *new_mems;
>  	cs->effective_mems = *new_mems;
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  
>  	/*
>  	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
> @@ -2216,10 +2216,10 @@ hotplug_update_tasks(struct cpuset *cs,
>  	if (nodes_empty(*new_mems))
>  		*new_mems = parent_cs(cs)->effective_mems;
>  
> -	spin_lock_irq(&callback_lock);
> +	raw_spin_lock_irq(&callback_lock);
>  	cpumask_copy(cs->effective_cpus, new_cpus);
>  	cs->effective_mems = *new_mems;
> -	spin_unlock_irq(&callback_lock);
> +	raw_spin_unlock_irq(&callback_lock);
>  
>  	if (cpus_updated)
>  		update_tasks_cpumask(cs);
> @@ -2312,21 +2312,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
>  
>  	/* synchronize cpus_allowed to cpu_active_mask */
>  	if (cpus_updated) {
> -		spin_lock_irq(&callback_lock);
> +		raw_spin_lock_irq(&callback_lock);
>  		if (!on_dfl)
>  			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
>  		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
> -		spin_unlock_irq(&callback_lock);
> +		raw_spin_unlock_irq(&callback_lock);
>  		/* we don't mess with cpumasks of tasks in top_cpuset */
>  	}
>  
>  	/* synchronize mems_allowed to N_MEMORY */
>  	if (mems_updated) {
> -		spin_lock_irq(&callback_lock);
> +		raw_spin_lock_irq(&callback_lock);
>  		if (!on_dfl)
>  			top_cpuset.mems_allowed = new_mems;
>  		top_cpuset.effective_mems = new_mems;
> -		spin_unlock_irq(&callback_lock);
> +		raw_spin_unlock_irq(&callback_lock);
>  		update_tasks_nodemask(&top_cpuset);
>  	}
>  
> @@ -2425,11 +2425,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&callback_lock, flags);
> +	raw_spin_lock_irqsave(&callback_lock, flags);
>  	rcu_read_lock();
>  	guarantee_online_cpus(task_cs(tsk), pmask);
>  	rcu_read_unlock();
> -	spin_unlock_irqrestore(&callback_lock, flags);
> +	raw_spin_unlock_irqrestore(&callback_lock, flags);
>  }
>  
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> @@ -2477,11 +2477,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
>  	nodemask_t mask;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&callback_lock, flags);
> +	raw_spin_lock_irqsave(&callback_lock, flags);
>  	rcu_read_lock();
>  	guarantee_online_mems(task_cs(tsk), &mask);
>  	rcu_read_unlock();
> -	spin_unlock_irqrestore(&callback_lock, flags);
> +	raw_spin_unlock_irqrestore(&callback_lock, flags);
>  
>  	return mask;
>  }
> @@ -2573,14 +2573,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
>  		return true;
>  
>  	/* Not hardwall and node outside mems_allowed: scan up cpusets */
> -	spin_lock_irqsave(&callback_lock, flags);
> +	raw_spin_lock_irqsave(&callback_lock, flags);
>  
>  	rcu_read_lock();
>  	cs = nearest_hardwall_ancestor(task_cs(current));
>  	allowed = node_isset(node, cs->mems_allowed);
>  	rcu_read_unlock();
>  
> -	spin_unlock_irqrestore(&callback_lock, flags);
> +	raw_spin_unlock_irqrestore(&callback_lock, flags);
>  	return allowed;
>  }
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-09-03 14:28 ` [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
@ 2018-10-03 19:42   ` Steven Rostedt
  2018-10-04  9:04     ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2018-10-03 19:42 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On Mon,  3 Sep 2018 16:28:00 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:


> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 5b43f482fa0f..8dc26005bb1e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2410,6 +2410,24 @@ void __init cpuset_init_smp(void)
>  	BUG_ON(!cpuset_migrate_mm_wq);
>  }
>  
> +/**
> + * cpuset_read_only_lock - Grab the callback_lock from another subsysytem
> + *
> + * Description: Gives the holder read-only access to cpusets.
> + */
> +void cpuset_read_only_lock(void)
> +{
> +	raw_spin_lock(&callback_lock);

This was confusing to figure out why grabbing a spinlock gives read
only access. So I read the long comment above the definition of
callback_lock. A couple of notes.

1) The above description needs to go into more detail as to why
grabbing a spinlock is "read only".

2) The comment above the callback_lock needs to incorporate this, as
reading that comment alone will not give anyone an idea that this
exists.

Other than that, I don't see any issue with this patch.

-- Steve


> +}
> +
> +/**
> + * cpuset_read_only_unlock - Release the callback_lock from another subsysytem
> + */
> +void cpuset_read_only_unlock(void)
> +{
> +	raw_spin_unlock(&callback_lock);
> +}
> +
>  /**
>   * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
>   * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 22f5622cba69..ac11ee599968 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4228,6 +4228,13 @@ static int __sched_setscheduler(struct task_struct *p,
>  	rq = task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
>  
> +	/*
> +	 * Make sure we don't race with the cpuset subsystem where root
> +	 * domains can be rebuilt or modified while operations like DL
> +	 * admission checks are carried out.
> +	 */
> +	cpuset_read_only_lock();
> +
>  	/*
>  	 * Changing the policy of the stop threads its a very bad idea:
>  	 */
> @@ -4289,6 +4296,7 @@ static int __sched_setscheduler(struct task_struct *p,
>  	/* Re-check policy now with rq lock held: */
>  	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
>  		policy = oldpolicy = -1;
> +		cpuset_read_only_unlock();
>  		task_rq_unlock(rq, p, &rf);
>  		goto recheck;
>  	}
> @@ -4346,6 +4354,7 @@ static int __sched_setscheduler(struct task_struct *p,
>  
>  	/* Avoid rq from going away on us: */
>  	preempt_disable();
> +	cpuset_read_only_unlock();
>  	task_rq_unlock(rq, p, &rf);
>  
>  	if (pi)
> @@ -4358,6 +4367,7 @@ static int __sched_setscheduler(struct task_struct *p,
>  	return 0;
>  
>  unlock:
> +	cpuset_read_only_unlock();
>  	task_rq_unlock(rq, p, &rf);
>  	return retval;
>  }


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-10-03 19:42   ` Steven Rostedt
@ 2018-10-04  9:04     ` Juri Lelli
  2018-11-08 15:49       ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-10-04  9:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 03/10/18 15:42, Steven Rostedt wrote:
> On Mon,  3 Sep 2018 16:28:00 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> 
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 5b43f482fa0f..8dc26005bb1e 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -2410,6 +2410,24 @@ void __init cpuset_init_smp(void)
> >  	BUG_ON(!cpuset_migrate_mm_wq);
> >  }
> >  
> > +/**
> > + * cpuset_read_only_lock - Grab the callback_lock from another subsysytem
> > + *
> > + * Description: Gives the holder read-only access to cpusets.
> > + */
> > +void cpuset_read_only_lock(void)
> > +{
> > +	raw_spin_lock(&callback_lock);
> 
> This was confusing to figure out why grabbing a spinlock gives read
> only access. So I read the long comment above the definition of
> callback_lock. A couple of notes.
> 
> 1) The above description needs to go into more detail as to why
> grabbing a spinlock is "read only".
> 
> 2) The comment above the callback_lock needs to incorporate this, as
> reading that comment alone will not give anyone an idea that this
> exists.

Right, does the updated version below look any better?

Thanks for reviewing!

Best,

- Juri

--->8---
From d704536ba80a01116007024ec055efcc4a9ee558 Mon Sep 17 00:00:00 2001
From: Mathieu Poirier <mathieu.poirier@linaro.org>
Date: Thu, 23 Aug 2018 14:52:13 +0200
Subject: [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and
 __sched_setscheduler()

No synchronisation mechanism exists between the cpuset subsystem and calls
to function __sched_setscheduler(). As such, it is possible that new root
domains are created on the cpuset side while a deadline acceptance test
is carried out in __sched_setscheduler(), leading to a potential oversell
of CPU bandwidth.

Grab callback_lock from core scheduler, so to prevent situations such as
the one described above from happening.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

---

v4->v5: grab callback_lock instead of cpuset_mutex, as callback_lock is
enough to get read-only access to cpusets [1] and it can be easily
converted to be a raw_spinlock (done in previous - new - patch).

[1] https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L275
---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 25 ++++++++++++++++++++++++-
 kernel/sched/core.c    | 10 ++++++++++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 934633a05d20..8e5a8dd0622b 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
+extern void cpuset_read_only_lock(void);
+extern void cpuset_read_only_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -176,6 +178,10 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
+static inline void cpuset_read_only_lock(void) { }
+
+static inline void cpuset_read_only_unlock(void) { }
+
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5b43f482fa0f..bff72b920624 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -273,7 +273,8 @@ static struct cpuset top_cpuset = {
  * __alloc_pages().
  *
  * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
+ * access to cpusets. Mind that callback_lock might be grabbed from other
+ * subsystems as well (via cpuset_read_only_lock()).
  *
  * Now, the task_struct fields mems_allowed and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
@@ -2410,6 +2411,28 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_read_only_lock - Grab the callback_lock from cpuset subsystem.
+ *
+ * Description: As described in full details the comment above cpuset_mutex
+ * and callback_lock definitions, holding callback_lock gives the holder
+ * read-only access to cpusets. Even though it might look counter-intuitive
+ * (as callback_lock is a spinlock), in fact a task must hold both
+ * callback_lock _and_ cpuset_mutex to modify cpusets (write access).
+ */
+void cpuset_read_only_lock(void)
+{
+	raw_spin_lock(&callback_lock);
+}
+
+/**
+ * cpuset_read_only_unlock - Release the callback_lock from cpuset subsystem.
+ */
+void cpuset_read_only_unlock(void)
+{
+	raw_spin_unlock(&callback_lock);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 22f5622cba69..ac11ee599968 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4228,6 +4228,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
+	/*
+	 * Make sure we don't race with the cpuset subsystem where root
+	 * domains can be rebuilt or modified while operations like DL
+	 * admission checks are carried out.
+	 */
+	cpuset_read_only_lock();
+
 	/*
 	 * Changing the policy of the stop threads its a very bad idea:
 	 */
@@ -4289,6 +4296,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	/* Re-check policy now with rq lock held: */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
+		cpuset_read_only_unlock();
 		task_rq_unlock(rq, p, &rf);
 		goto recheck;
 	}
@@ -4346,6 +4354,7 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();
+	cpuset_read_only_unlock();
 	task_rq_unlock(rq, p, &rf);
 
 	if (pi)
@@ -4358,6 +4367,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	return 0;
 
 unlock:
+	cpuset_read_only_unlock();
 	task_rq_unlock(rq, p, &rf);
 	return retval;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-09-25 14:34   ` Juri Lelli
@ 2018-11-07  9:59     ` Juri Lelli
  2018-11-07 15:53     ` Tejun Heo
  1 sibling, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-11-07  9:59 UTC (permalink / raw)
  To: lizefan, tj
  Cc: linux-kernel, luca.abeni, claudio, tommaso.cucinotta, bristot,
	mathieu.poirier, cgroups, peterz, mingo, rostedt

Hi,

Ping.

Thanks,

- Juri

On 25/09/18 16:34, Juri Lelli wrote:
> Hi Li Zefan and Tejun Heo,
> 
> It would be great if you could please have a look at the proposed change
> below (and the rest of the set of course :-).
> 
> Another bit that I'd be more comfortable after hearing your word on it
> is this one (discussed over 5/5):
> 
> https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/
> 
> Best,
> 
> - Juri
> 
> On 03/09/18 16:27, Juri Lelli wrote:
> > callback_lock grants the holder read-only access to cpusets.  For fixing
> > a synchronization issue between cpusets and scheduler core, it is now
> > required to make callback_lock available to core scheduler code.
> > 
> > Convert callback_lock to raw_spin_lock, so that it will be always safe
> > to acquire it from atomic context.
> > 
> > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > ---
> >  kernel/cgroup/cpuset.c | 66 +++++++++++++++++++++---------------------
> >  1 file changed, 33 insertions(+), 33 deletions(-)
> > 
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 266f10cb7222..5b43f482fa0f 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -288,7 +288,7 @@ static struct cpuset top_cpuset = {
> >   */
> >  
> >  static DEFINE_MUTEX(cpuset_mutex);
> > -static DEFINE_SPINLOCK(callback_lock);
> > +static DEFINE_RAW_SPINLOCK(callback_lock);
> >  
> >  static struct workqueue_struct *cpuset_migrate_mm_wq;
> >  
> > @@ -922,9 +922,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
> >  			continue;
> >  		rcu_read_unlock();
> >  
> > -		spin_lock_irq(&callback_lock);
> > +		raw_spin_lock_irq(&callback_lock);
> >  		cpumask_copy(cp->effective_cpus, new_cpus);
> > -		spin_unlock_irq(&callback_lock);
> > +		raw_spin_unlock_irq(&callback_lock);
> >  
> >  		WARN_ON(!is_in_v2_mode() &&
> >  			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
> > @@ -989,9 +989,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
> >  	if (retval < 0)
> >  		return retval;
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  
> >  	/* use trialcs->cpus_allowed as a temp variable */
> >  	update_cpumasks_hier(cs, trialcs->cpus_allowed);
> > @@ -1175,9 +1175,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
> >  			continue;
> >  		rcu_read_unlock();
> >  
> > -		spin_lock_irq(&callback_lock);
> > +		raw_spin_lock_irq(&callback_lock);
> >  		cp->effective_mems = *new_mems;
> > -		spin_unlock_irq(&callback_lock);
> > +		raw_spin_unlock_irq(&callback_lock);
> >  
> >  		WARN_ON(!is_in_v2_mode() &&
> >  			!nodes_equal(cp->mems_allowed, cp->effective_mems));
> > @@ -1245,9 +1245,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
> >  	if (retval < 0)
> >  		goto done;
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	cs->mems_allowed = trialcs->mems_allowed;
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  
> >  	/* use trialcs->mems_allowed as a temp variable */
> >  	update_nodemasks_hier(cs, &trialcs->mems_allowed);
> > @@ -1338,9 +1338,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
> >  	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
> >  			|| (is_spread_page(cs) != is_spread_page(trialcs)));
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	cs->flags = trialcs->flags;
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  
> >  	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
> >  		rebuild_sched_domains_locked();
> > @@ -1755,7 +1755,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
> >  	cpuset_filetype_t type = seq_cft(sf)->private;
> >  	int ret = 0;
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  
> >  	switch (type) {
> >  	case FILE_CPULIST:
> > @@ -1774,7 +1774,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
> >  		ret = -EINVAL;
> >  	}
> >  
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  	return ret;
> >  }
> >  
> > @@ -1989,12 +1989,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
> >  
> >  	cpuset_inc();
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	if (is_in_v2_mode()) {
> >  		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
> >  		cs->effective_mems = parent->effective_mems;
> >  	}
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  
> >  	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
> >  		goto out_unlock;
> > @@ -2021,12 +2021,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
> >  	}
> >  	rcu_read_unlock();
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	cs->mems_allowed = parent->mems_allowed;
> >  	cs->effective_mems = parent->mems_allowed;
> >  	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
> >  	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  out_unlock:
> >  	mutex_unlock(&cpuset_mutex);
> >  	return 0;
> > @@ -2065,7 +2065,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
> >  static void cpuset_bind(struct cgroup_subsys_state *root_css)
> >  {
> >  	mutex_lock(&cpuset_mutex);
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  
> >  	if (is_in_v2_mode()) {
> >  		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
> > @@ -2076,7 +2076,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
> >  		top_cpuset.mems_allowed = top_cpuset.effective_mems;
> >  	}
> >  
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  	mutex_unlock(&cpuset_mutex);
> >  }
> >  
> > @@ -2174,12 +2174,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
> >  {
> >  	bool is_empty;
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	cpumask_copy(cs->cpus_allowed, new_cpus);
> >  	cpumask_copy(cs->effective_cpus, new_cpus);
> >  	cs->mems_allowed = *new_mems;
> >  	cs->effective_mems = *new_mems;
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  
> >  	/*
> >  	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
> > @@ -2216,10 +2216,10 @@ hotplug_update_tasks(struct cpuset *cs,
> >  	if (nodes_empty(*new_mems))
> >  		*new_mems = parent_cs(cs)->effective_mems;
> >  
> > -	spin_lock_irq(&callback_lock);
> > +	raw_spin_lock_irq(&callback_lock);
> >  	cpumask_copy(cs->effective_cpus, new_cpus);
> >  	cs->effective_mems = *new_mems;
> > -	spin_unlock_irq(&callback_lock);
> > +	raw_spin_unlock_irq(&callback_lock);
> >  
> >  	if (cpus_updated)
> >  		update_tasks_cpumask(cs);
> > @@ -2312,21 +2312,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
> >  
> >  	/* synchronize cpus_allowed to cpu_active_mask */
> >  	if (cpus_updated) {
> > -		spin_lock_irq(&callback_lock);
> > +		raw_spin_lock_irq(&callback_lock);
> >  		if (!on_dfl)
> >  			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
> >  		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
> > -		spin_unlock_irq(&callback_lock);
> > +		raw_spin_unlock_irq(&callback_lock);
> >  		/* we don't mess with cpumasks of tasks in top_cpuset */
> >  	}
> >  
> >  	/* synchronize mems_allowed to N_MEMORY */
> >  	if (mems_updated) {
> > -		spin_lock_irq(&callback_lock);
> > +		raw_spin_lock_irq(&callback_lock);
> >  		if (!on_dfl)
> >  			top_cpuset.mems_allowed = new_mems;
> >  		top_cpuset.effective_mems = new_mems;
> > -		spin_unlock_irq(&callback_lock);
> > +		raw_spin_unlock_irq(&callback_lock);
> >  		update_tasks_nodemask(&top_cpuset);
> >  	}
> >  
> > @@ -2425,11 +2425,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> >  {
> >  	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&callback_lock, flags);
> > +	raw_spin_lock_irqsave(&callback_lock, flags);
> >  	rcu_read_lock();
> >  	guarantee_online_cpus(task_cs(tsk), pmask);
> >  	rcu_read_unlock();
> > -	spin_unlock_irqrestore(&callback_lock, flags);
> > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> >  }
> >  
> >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > @@ -2477,11 +2477,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> >  	nodemask_t mask;
> >  	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&callback_lock, flags);
> > +	raw_spin_lock_irqsave(&callback_lock, flags);
> >  	rcu_read_lock();
> >  	guarantee_online_mems(task_cs(tsk), &mask);
> >  	rcu_read_unlock();
> > -	spin_unlock_irqrestore(&callback_lock, flags);
> > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> >  
> >  	return mask;
> >  }
> > @@ -2573,14 +2573,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
> >  		return true;
> >  
> >  	/* Not hardwall and node outside mems_allowed: scan up cpusets */
> > -	spin_lock_irqsave(&callback_lock, flags);
> > +	raw_spin_lock_irqsave(&callback_lock, flags);
> >  
> >  	rcu_read_lock();
> >  	cs = nearest_hardwall_ancestor(task_cs(current));
> >  	allowed = node_isset(node, cs->mems_allowed);
> >  	rcu_read_unlock();
> >  
> > -	spin_unlock_irqrestore(&callback_lock, flags);
> > +	raw_spin_unlock_irqrestore(&callback_lock, flags);
> >  	return allowed;
> >  }
> >  
> > -- 
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-09-25 14:34   ` Juri Lelli
  2018-11-07  9:59     ` Juri Lelli
@ 2018-11-07 15:53     ` Tejun Heo
  2018-11-07 16:38       ` Juri Lelli
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2018-11-07 15:53 UTC (permalink / raw)
  To: Juri Lelli
  Cc: lizefan, linux-kernel, luca.abeni, claudio, tommaso.cucinotta,
	bristot, mathieu.poirier, cgroups, peterz, mingo, rostedt

Hello,

On Tue, Sep 25, 2018 at 04:34:16PM +0200, Juri Lelli wrote:
> It would be great if you could please have a look at the proposed change
> below (and the rest of the set of course :-).

Yeah, looks good to me.  Please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

> Another bit that I'd be more comfortable after hearing your word on it
> is this one (discussed over 5/5):
> 
> https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/

Can you please loop Waiman Long <longman@redhat.com> into discussion?
He's working on cgroup2 cpuset support which might collide.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-11-07 15:53     ` Tejun Heo
@ 2018-11-07 16:38       ` Juri Lelli
  2018-11-08 11:22         ` Juri Lelli
  2018-11-08 19:11         ` Waiman Long
  0 siblings, 2 replies; 22+ messages in thread
From: Juri Lelli @ 2018-11-07 16:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, linux-kernel, luca.abeni, claudio, tommaso.cucinotta,
	bristot, mathieu.poirier, cgroups, peterz, mingo, rostedt,
	longman

Hi,

On 07/11/18 07:53, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 25, 2018 at 04:34:16PM +0200, Juri Lelli wrote:
> > It would be great if you could please have a look at the proposed change
> > below (and the rest of the set of course :-).
> 
> Yeah, looks good to me.  Please feel free to add
> 
>  Acked-by: Tejun Heo <tj@kernel.org>

Thanks!

> > Another bit that I'd be more comfortable after hearing your word on it
> > is this one (discussed over 5/5):
> > 
> > https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/
> 
> Can you please loop Waiman Long <longman@redhat.com> into discussion?
> He's working on cgroup2 cpuset support which might collide.

Sure, I've been originally working on this on top of his series, but
didn't try with latest version. Hopefully the two series don't generate
too much collisions. I'll try to find some time soon to check again.

In the meantime, Waiman Long, how do you feel about it? :-)

Thread starts at (you if missed it)

https://lore.kernel.org/lkml/20180903142801.20046-1-juri.lelli@redhat.com/

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-11-07 16:38       ` Juri Lelli
@ 2018-11-08 11:22         ` Juri Lelli
  2018-11-08 19:11         ` Waiman Long
  1 sibling, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-11-08 11:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, linux-kernel, luca.abeni, claudio, tommaso.cucinotta,
	bristot, mathieu.poirier, cgroups, peterz, mingo, rostedt,
	longman

On 07/11/18 17:38, Juri Lelli wrote:
> Hi,
> 
> On 07/11/18 07:53, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Sep 25, 2018 at 04:34:16PM +0200, Juri Lelli wrote:
> > > It would be great if you could please have a look at the proposed change
> > > below (and the rest of the set of course :-).
> > 
> > Yeah, looks good to me.  Please feel free to add
> > 
> >  Acked-by: Tejun Heo <tj@kernel.org>
> 
> Thanks!
> 
> > > Another bit that I'd be more comfortable after hearing your word on it
> > > is this one (discussed over 5/5):
> > > 
> > > https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/
> > 
> > Can you please loop Waiman Long <longman@redhat.com> into discussion?
> > He's working on cgroup2 cpuset support which might collide.
> 
> Sure, I've been originally working on this on top of his series, but
> didn't try with latest version. Hopefully the two series don't generate
> too much collisions. I'll try to find some time soon to check again.

So, conflicts weren't too bad (on top of v14).

I guess I'll wait for Waiman Long's patches to land for rebasing again
and testing.

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-10-04  9:04     ` Juri Lelli
@ 2018-11-08 15:49       ` Waiman Long
  2018-11-08 16:23         ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2018-11-08 15:49 UTC (permalink / raw)
  To: Juri Lelli, Steven Rostedt
  Cc: peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

On 10/04/2018 05:04 AM, Juri Lelli wrote:
> On 03/10/18 15:42, Steven Rostedt wrote:
>> On Mon,  3 Sep 2018 16:28:00 +0200
>> Juri Lelli <juri.lelli@redhat.com> wrote:
>>
>>
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index 5b43f482fa0f..8dc26005bb1e 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -2410,6 +2410,24 @@ void __init cpuset_init_smp(void)
>>>  	BUG_ON(!cpuset_migrate_mm_wq);
>>>  }
>>>  
>>> +/**
>>> + * cpuset_read_only_lock - Grab the callback_lock from another subsysytem
>>> + *
>>> + * Description: Gives the holder read-only access to cpusets.
>>> + */
>>> +void cpuset_read_only_lock(void)
>>> +{
>>> +	raw_spin_lock(&callback_lock);
>> This was confusing to figure out why grabbing a spinlock gives read
>> only access. So I read the long comment above the definition of
>> callback_lock. A couple of notes.
>>
>> 1) The above description needs to go into more detail as to why
>> grabbing a spinlock is "read only".
>>
>> 2) The comment above the callback_lock needs to incorporate this, as
>> reading that comment alone will not give anyone an idea that this
>> exists.
> Right, does the updated version below look any better?
>
> Thanks for reviewing!
>
> Best,
>
> - Juri
>
> --->8---
> From d704536ba80a01116007024ec055efcc4a9ee558 Mon Sep 17 00:00:00 2001
> From: Mathieu Poirier <mathieu.poirier@linaro.org>
> Date: Thu, 23 Aug 2018 14:52:13 +0200
> Subject: [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and
>  __sched_setscheduler()
>
> No synchronisation mechanism exists between the cpuset subsystem and calls
> to function __sched_setscheduler(). As such, it is possible that new root
> domains are created on the cpuset side while a deadline acceptance test
> is carried out in __sched_setscheduler(), leading to a potential oversell
> of CPU bandwidth.
>
> Grab callback_lock from core scheduler, so to prevent situations such as
> the one described above from happening.
>
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
>
> ---
>
> v4->v5: grab callback_lock instead of cpuset_mutex, as callback_lock is
> enough to get read-only access to cpusets [1] and it can be easily
> converted to be a raw_spinlock (done in previous - new - patch).
>
> [1] https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L275
> ---
>  include/linux/cpuset.h |  6 ++++++
>  kernel/cgroup/cpuset.c | 25 ++++++++++++++++++++++++-
>  kernel/sched/core.c    | 10 ++++++++++
>  3 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 934633a05d20..8e5a8dd0622b 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
>  extern void cpuset_force_rebuild(void);
>  extern void cpuset_update_active_cpus(void);
>  extern void cpuset_wait_for_hotplug(void);
> +extern void cpuset_read_only_lock(void);
> +extern void cpuset_read_only_unlock(void);
>  extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
>  extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
> @@ -176,6 +178,10 @@ static inline void cpuset_update_active_cpus(void)
>  
>  static inline void cpuset_wait_for_hotplug(void) { }
>  
> +static inline void cpuset_read_only_lock(void) { }
> +
> +static inline void cpuset_read_only_unlock(void) { }
> +
>  static inline void cpuset_cpus_allowed(struct task_struct *p,
>  				       struct cpumask *mask)
>  {
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 5b43f482fa0f..bff72b920624 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -273,7 +273,8 @@ static struct cpuset top_cpuset = {
>   * __alloc_pages().
>   *
>   * If a task is only holding callback_lock, then it has read-only
> - * access to cpusets.
> + * access to cpusets. Mind that callback_lock might be grabbed from other
> + * subsystems as well (via cpuset_read_only_lock()).
>   *
>   * Now, the task_struct fields mems_allowed and mempolicy may be changed
>   * by other task, we use alloc_lock in the task_struct fields to protect
> @@ -2410,6 +2411,28 @@ void __init cpuset_init_smp(void)
>  	BUG_ON(!cpuset_migrate_mm_wq);
>  }
>  
> +/**
> + * cpuset_read_only_lock - Grab the callback_lock from cpuset subsystem.
> + *
> + * Description: As described in full details the comment above cpuset_mutex
> + * and callback_lock definitions, holding callback_lock gives the holder
> + * read-only access to cpusets. Even though it might look counter-intuitive
> + * (as callback_lock is a spinlock), in fact a task must hold both
> + * callback_lock _and_ cpuset_mutex to modify cpusets (write access).
> + */
> +void cpuset_read_only_lock(void)
> +{
> +	raw_spin_lock(&callback_lock);
> +}
> +
> +/**
> + * cpuset_read_only_unlock - Release the callback_lock from cpuset subsystem.
> + */
> +void cpuset_read_only_unlock(void)
> +{
> +	raw_spin_unlock(&callback_lock);
> +}
> +

Maybe you can drop the "_only" part to be consistent with the rwlock
APIs (read_lock/write_lock).

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
  2018-11-08 15:49       ` Waiman Long
@ 2018-11-08 16:23         ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-11-08 16:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Steven Rostedt, peterz, mingo, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, lizefan, cgroups

Hi,

On 08/11/18 10:49, Waiman Long wrote:
> On 10/04/2018 05:04 AM, Juri Lelli wrote:

[...]

> > +/**
> > + * cpuset_read_only_lock - Grab the callback_lock from cpuset subsystem.
> > + *
> > + * Description: As described in full details the comment above cpuset_mutex
> > + * and callback_lock definitions, holding callback_lock gives the holder
> > + * read-only access to cpusets. Even though it might look counter-intuitive
> > + * (as callback_lock is a spinlock), in fact a task must hold both
> > + * callback_lock _and_ cpuset_mutex to modify cpusets (write access).
> > + */
> > +void cpuset_read_only_lock(void)
> > +{
> > +	raw_spin_lock(&callback_lock);
> > +}
> > +
> > +/**
> > + * cpuset_read_only_unlock - Release the callback_lock from cpuset subsystem.
> > + */
> > +void cpuset_read_only_unlock(void)
> > +{
> > +	raw_spin_unlock(&callback_lock);
> > +}
> > +
> 
> Maybe you can drop the "_only" part to be consistent with the rwlock
> APIs (read_lock/write_lock).

I called it this way because it looked more descriptive of which kind of
guarantee the holder gets using these. Can change of course, what others
think?

Thanks for reviewing!

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-11-07 16:38       ` Juri Lelli
  2018-11-08 11:22         ` Juri Lelli
@ 2018-11-08 19:11         ` Waiman Long
  2018-11-09 10:34           ` Juri Lelli
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2018-11-08 19:11 UTC (permalink / raw)
  To: Juri Lelli, Tejun Heo
  Cc: lizefan, linux-kernel, luca.abeni, claudio, tommaso.cucinotta,
	bristot, mathieu.poirier, cgroups, peterz, mingo, rostedt

On 11/07/2018 11:38 AM, Juri Lelli wrote:
> Hi,
>
> On 07/11/18 07:53, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Sep 25, 2018 at 04:34:16PM +0200, Juri Lelli wrote:
>>> It would be great if you could please have a look at the proposed change
>>> below (and the rest of the set of course :-).
>> Yeah, looks good to me.  Please feel free to add
>>
>>  Acked-by: Tejun Heo <tj@kernel.org>
> Thanks!
>
>>> Another bit that I'd be more comfortable after hearing your word on it
>>> is this one (discussed over 5/5):
>>>
>>> https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/
>> Can you please loop Waiman Long <longman@redhat.com> into discussion?
>> He's working on cgroup2 cpuset support which might collide.
> Sure, I've been originally working on this on top of his series, but
> didn't try with latest version. Hopefully the two series don't generate
> too much collisions. I'll try to find some time soon to check again.
>
> In the meantime, Waiman Long, how do you feel about it? :-)
>
> Thread starts at (you if missed it)
>
> https://lore.kernel.org/lkml/20180903142801.20046-1-juri.lelli@redhat.com/
>
> Best,
>
> - Juri

Your patches look good to me. There will be some minor conflicts, I
think, but nothing big.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw
  2018-11-08 19:11         ` Waiman Long
@ 2018-11-09 10:34           ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-11-09 10:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, lizefan, linux-kernel, luca.abeni, claudio,
	tommaso.cucinotta, bristot, mathieu.poirier, cgroups, peterz,
	mingo, rostedt

On 08/11/18 14:11, Waiman Long wrote:
> On 11/07/2018 11:38 AM, Juri Lelli wrote:
> > Hi,
> >
> > On 07/11/18 07:53, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Sep 25, 2018 at 04:34:16PM +0200, Juri Lelli wrote:
> >>> It would be great if you could please have a look at the proposed change
> >>> below (and the rest of the set of course :-).
> >> Yeah, looks good to me.  Please feel free to add
> >>
> >>  Acked-by: Tejun Heo <tj@kernel.org>
> > Thanks!
> >
> >>> Another bit that I'd be more comfortable after hearing your word on it
> >>> is this one (discussed over 5/5):
> >>>
> >>> https://lore.kernel.org/lkml/20180925130750.GA25664@localhost.localdomain/
> >> Can you please loop Waiman Long <longman@redhat.com> into discussion?
> >> He's working on cgroup2 cpuset support which might collide.
> > Sure, I've been originally working on this on top of his series, but
> > didn't try with latest version. Hopefully the two series don't generate
> > too much collisions. I'll try to find some time soon to check again.
> >
> > In the meantime, Waiman Long, how do you feel about it? :-)
> >
> > Thread starts at (you if missed it)
> >
> > https://lore.kernel.org/lkml/20180903142801.20046-1-juri.lelli@redhat.com/
> >
> > Best,
> >
> > - Juri
> 
> Your patches look good to me. There will be some minor conflicts, I
> think, but nothing big.

Thanks a lot for reviewing them.

Going to test and respin soon.

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-11-09 10:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-03 14:27 [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2018-09-03 14:27 ` [PATCH v5 1/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2018-09-03 14:27 ` [PATCH v5 2/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2018-09-03 14:27 ` [PATCH v5 3/5] cgroup/cpuset: make callback_lock raw Juri Lelli
2018-09-25 14:34   ` Juri Lelli
2018-11-07  9:59     ` Juri Lelli
2018-11-07 15:53     ` Tejun Heo
2018-11-07 16:38       ` Juri Lelli
2018-11-08 11:22         ` Juri Lelli
2018-11-08 19:11         ` Waiman Long
2018-11-09 10:34           ` Juri Lelli
2018-09-03 14:28 ` [PATCH v5 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2018-10-03 19:42   ` Steven Rostedt
2018-10-04  9:04     ` Juri Lelli
2018-11-08 15:49       ` Waiman Long
2018-11-08 16:23         ` Juri Lelli
2018-09-03 14:28 ` [PATCH v5 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli
2018-09-25 12:32   ` Peter Zijlstra
2018-09-25 13:07     ` Juri Lelli
2018-09-25 12:53   ` Peter Zijlstra
2018-09-25 13:08     ` Juri Lelli
2018-09-25  8:14 ` [PATCH v5 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).