linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] sched: Persistent user requested affinity
@ 2022-08-26  1:01 Waiman Long
  2022-08-26  1:01 ` [PATCH v6 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-26  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

v6:
 - Drop the cpuset patch and make user_cpus_ptr masking applicable to
   all callers of set_cpus_allowed_ptr().
 - Make user_cpus_ptr and cpus_mask update in a single lock critical
   section to avoid race problems.

v5:
 - Update patch 3 to handle race with concurrent sched_setaffinity()
   by rechecking a previously cleared user_cpus_ptr afterward.

v4:
 - Update patch 1 to make sched_setaffinity() the only function to
   update user_cpus_ptr to make the logic simpler and
   easier to understand. restrict_cpus_allowed_ptr() and
   relax_compatible_cpus_allowed_ptr() will just use it if present.

The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
Introduce task_struct::user_cpus_ptr to track requested affinity")
which uses it narrowly to allow keeping cpus affinity intact with
asymmetric cpu setup.

This patchset extends user_cpus_ptr to store user requested cpus affinity
via the sched_setaffinity() API. With that information available, it will
enable cpuset and other callers of set_cpus_allowed_ptr() like hotplug
to keep cpus afinity as close to what the user wants as possible within
the cpu list constraint of the current cpuset. Otherwise some change
to the cpuset hierarchy or a hotplug event may reset the cpumask of
the tasks in the affected cpusets to the default cpuset value even if
those tasks have cpus affinity explicitly set by the users before.

It also means that once sched_setaffinity() is called, user_cpus_ptr
will remain allocated until the task exits.

Waiman Long (5):
  sched: Add __releases annotations to affine_move_task()
  sched: Use user_cpus_ptr for saving user provided cpumask in
    sched_setaffinity()
  sched: Enforce user requested affinity
  sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race
  sched: Fix sched_setaffinity() and fork/clone() race

 kernel/sched/core.c  | 174 +++++++++++++++++++++++++++----------------
 kernel/sched/sched.h |  11 ++-
 2 files changed, 121 insertions(+), 64 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/5] sched: Add __releases annotations to affine_move_task()
  2022-08-26  1:01 [PATCH v6 0/5] sched: Persistent user requested affinity Waiman Long
@ 2022-08-26  1:01 ` Waiman Long
  2022-08-26  1:01 ` [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-26  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

affine_move_task() assumes task_rq_lock() has been called and it does
an implicit task_rq_unlock() before returning. Add the appropriate
__releases annotations to make this clear.

A typo error in comment is also fixed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..b351e6d173b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2696,6 +2696,8 @@ void release_user_cpus_ptr(struct task_struct *p)
  */
 static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf,
 			    int dest_cpu, unsigned int flags)
+	__releases(rq->lock)
+	__releases(p->pi_lock)
 {
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
@@ -3005,7 +3007,7 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 
 /*
  * Restrict the CPU affinity of task @p so that it is a subset of
- * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the
+ * task_cpu_possible_mask() and point @p->user_cpus_ptr to a copy of the
  * old affinity mask. If the resulting mask is empty, we warn and walk
  * up the cpuset hierarchy until we find a suitable mask.
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2022-08-26  1:01 [PATCH v6 0/5] sched: Persistent user requested affinity Waiman Long
  2022-08-26  1:01 ` [PATCH v6 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
@ 2022-08-26  1:01 ` Waiman Long
  2022-08-31  9:12   ` Peter Zijlstra
  2022-08-26  1:01 ` [PATCH v6 3/5] sched: Enforce user requested affinity Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2022-08-26  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
Introduce task_struct::user_cpus_ptr to track requested affinity"). It
is currently used only by arm64 arch due to possible asymmetric CPU
setup. This patch extends its usage to save user provided cpumask
when sched_setaffinity() is called for all arches. With this patch
applied, user_cpus_ptr, once allocated after a successful call to
sched_setaffinity(), will only be freed when the task exits.

Since user_cpus_ptr is supposed to be used for "requested
affinity", there is actually no point to save current cpu affinity in
restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
if defined but not changing it.

This will be some changes in behavior for arm64 systems with asymmetric
CPUs in some corner cases. For instance, if sched_setaffinity()
has never been called and there is a cpuset change before
relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
follow what the cpuset allows but not what the previous cpu affinity
setting allows.

As a call to sched_setaffinity() will no longer clear user_cpus_ptr
but set it instead, the SCA_USER flag is no longer necessary and can
be removed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c  | 87 +++++++++++++++++++-------------------------
 kernel/sched/sched.h |  8 +++-
 2 files changed, 45 insertions(+), 50 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b351e6d173b7..ac2b103d69dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2850,7 +2850,6 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
 	const struct cpumask *cpu_valid_mask = cpu_active_mask;
 	bool kthread = p->flags & PF_KTHREAD;
-	struct cpumask *user_mask = NULL;
 	unsigned int dest_cpu;
 	int ret = 0;
 
@@ -2909,14 +2908,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, new_mask, flags);
 
-	if (flags & SCA_USER)
-		user_mask = clear_user_cpus_ptr(p);
-
-	ret = affine_move_task(rq, p, rf, dest_cpu, flags);
-
-	kfree(user_mask);
-
-	return ret;
+	return affine_move_task(rq, p, rf, dest_cpu, flags);
 
 out:
 	task_rq_unlock(rq, p, rf);
@@ -2951,8 +2943,10 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
  * Change a given task's CPU affinity to the intersection of its current
- * affinity mask and @subset_mask, writing the resulting mask to @new_mask
- * and pointing @p->user_cpus_ptr to a copy of the old mask.
+ * affinity mask and @subset_mask, writing the resulting mask to @new_mask.
+ * If user_cpus_ptr is defined, use it as the basis for restricting CPU
+ * affinity or use cpu_online_mask instead.
+ *
  * If the resulting mask is empty, leave the affinity unchanged and return
  * -EINVAL.
  */
@@ -2960,17 +2954,10 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 				     struct cpumask *new_mask,
 				     const struct cpumask *subset_mask)
 {
-	struct cpumask *user_mask = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 	int err;
 
-	if (!p->user_cpus_ptr) {
-		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
-		if (!user_mask)
-			return -ENOMEM;
-	}
-
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -2983,25 +2970,15 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 		goto err_unlock;
 	}
 
-	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
+	if (!cpumask_and(new_mask, task_user_cpus(p), subset_mask)) {
 		err = -EINVAL;
 		goto err_unlock;
 	}
 
-	/*
-	 * We're about to butcher the task affinity, so keep track of what
-	 * the user asked for in case we're able to restore it later on.
-	 */
-	if (user_mask) {
-		cpumask_copy(user_mask, p->cpus_ptr);
-		p->user_cpus_ptr = user_mask;
-	}
-
 	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
 
 err_unlock:
 	task_rq_unlock(rq, p, &rf);
-	kfree(user_mask);
 	return err;
 }
 
@@ -3051,34 +3028,22 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p)
 }
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, bool save_mask);
 
 /*
  * Restore the affinity of a task @p which was previously restricted by a
- * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
- * @p->user_cpus_ptr.
+ * call to force_compatible_cpus_allowed_ptr().
  *
  * It is the caller's responsibility to serialise this with any calls to
  * force_compatible_cpus_allowed_ptr(@p).
  */
 void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 {
-	struct cpumask *user_mask = p->user_cpus_ptr;
-	unsigned long flags;
-
 	/*
-	 * Try to restore the old affinity mask. If this fails, then
-	 * we free the mask explicitly to avoid it being inherited across
-	 * a subsequent fork().
+	 * Try to restore the old affinity mask with __sched_setaffinity().
+	 * Cpuset masking will be done there too.
 	 */
-	if (!user_mask || !__sched_setaffinity(p, user_mask))
-		return;
-
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	user_mask = clear_user_cpus_ptr(p);
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-
-	kfree(user_mask);
+	__sched_setaffinity(p, task_user_cpus(p), false);
 }
 
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
@@ -8081,10 +8046,11 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
 #endif
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, bool save_mask)
 {
 	int retval;
 	cpumask_var_t cpus_allowed, new_mask;
+	struct cpumask *user_mask = NULL;
 
 	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
 		return -ENOMEM;
@@ -8100,8 +8066,22 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	retval = dl_task_check_affinity(p, new_mask);
 	if (retval)
 		goto out_free_new_mask;
+
+	/*
+	 * Save the user requested mask internally now and then update
+	 * user_cpus_ptr later after making sure this call will be
+	 * successful, i.e. retval == 0.
+	 */
+	if (save_mask) {
+		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
+		if (!user_mask) {
+			retval = -ENOMEM;
+			goto out_free_new_mask;
+		}
+		cpumask_copy(user_mask, mask);
+	}
 again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
+	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
 	if (retval)
 		goto out_free_new_mask;
 
@@ -8115,7 +8095,16 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 		goto again;
 	}
 
+	if (save_mask) {
+		unsigned long flags;
+
+		/* Use pi_lock to synchronize changes to user_cpus_ptr */
+		raw_spin_lock_irqsave(&p->pi_lock, flags);
+		swap(p->user_cpus_ptr, user_mask);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	}
 out_free_new_mask:
+	kfree(user_mask);
 	free_cpumask_var(new_mask);
 out_free_cpus_allowed:
 	free_cpumask_var(cpus_allowed);
@@ -8158,7 +8147,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_put_task;
 
-	retval = __sched_setaffinity(p, in_mask);
+	retval = __sched_setaffinity(p, in_mask, true);
 out_put_task:
 	put_task_struct(p);
 	return retval;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..a49c17e1c7ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1881,6 +1881,13 @@ static inline void dirty_sched_domain_sysctl(int cpu)
 #endif
 
 extern int sched_update_scaling(void);
+
+static inline const struct cpumask *task_user_cpus(struct task_struct *p)
+{
+	if (!p->user_cpus_ptr)
+		return cpu_possible_mask; /* &init_task.cpus_mask */
+	return p->user_cpus_ptr;
+}
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
@@ -2283,7 +2290,6 @@ extern struct task_struct *pick_next_task_idle(struct rq *rq);
 #define SCA_CHECK		0x01
 #define SCA_MIGRATE_DISABLE	0x02
 #define SCA_MIGRATE_ENABLE	0x04
-#define SCA_USER		0x08
 
 #ifdef CONFIG_SMP
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 3/5] sched: Enforce user requested affinity
  2022-08-26  1:01 [PATCH v6 0/5] sched: Persistent user requested affinity Waiman Long
  2022-08-26  1:01 ` [PATCH v6 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
  2022-08-26  1:01 ` [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
@ 2022-08-26  1:01 ` Waiman Long
  2022-08-31  9:14   ` Peter Zijlstra
  2022-08-31  9:18   ` Peter Zijlstra
  2022-08-26  1:01 ` [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race Waiman Long
  2022-08-26  1:01 ` [PATCH v6 5/5] sched: Fix sched_setaffinity() and fork/clone() race Waiman Long
  4 siblings, 2 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-26  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

It was found that the user requested affinity via sched_setaffinity()
can be easily overwritten by other kernel subsystems without an easy way
to reset it back to what the user requested. For example, any change
to the current cpuset hierarchy may reset the cpumask of the tasks in
the affected cpusets to the default cpuset value even if those tasks
have pre-existing user requested affinity. That is especially easy to
trigger under a cgroup v2 environment where writing "+cpuset" to the
root cgroup's cgroup.subtree_control file will reset the cpus affinity
of all the processes in the system.

That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.

Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
and use it to restrcit the given cpumask unless there is no overlap. In
that case, it will fallback to the given one.

All callers of set_cpus_allowed_ptr() will be affected by this change.
A scratch cpumask is added to percpu runqueues structure for doing
additional masking when user_cpus_ptr is set. The scratch cpumask should
get allocated during cpu activation. A fallback atomic memory allocation
in __set_cpus_allowed_ptr() is also added in case set_cpus_allowed_ptr()
is called before the scratch cpumask is properly allocated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c  | 36 +++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  3 +++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ac2b103d69dc..1c2f548e5369 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2928,11 +2928,40 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 static int __set_cpus_allowed_ptr(struct task_struct *p,
 				  const struct cpumask *new_mask, u32 flags)
 {
+	struct cpumask *alloc_mask = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
+	int ret;
 
 	rq = task_rq_lock(p, &rf);
-	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+	if (p->user_cpus_ptr) {
+
+		/*
+		 * A scratch cpumask is allocated on the percpu runqueues
+		 * to enable additional masking with user_cpus_ptr. This
+		 * cpumask, once allocated, will not be freed.
+		 */
+		if (unlikely(!rq->scratch_mask)) {
+			alloc_mask = kmalloc(cpumask_size(), GFP_ATOMIC);
+			if (!rq->scratch_mask && alloc_mask) {
+				rq->scratch_mask = alloc_mask;
+				alloc_mask = NULL;
+			}
+		}
+		/*
+		 * Ignore user_cpus_ptr if atomic memory allocation fails
+		 * or it doesn't intersect new_mask.
+		 */
+		if (rq->scratch_mask &&
+		    cpumask_and(rq->scratch_mask, new_mask, p->user_cpus_ptr))
+			new_mask = rq->scratch_mask;
+	}
+
+
+	ret = __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+	if (unlikely(alloc_mask))
+		kfree(alloc_mask);
+	return ret;
 }
 
 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
@@ -9352,6 +9381,11 @@ int sched_cpu_activate(unsigned int cpu)
 		sched_update_numa(cpu, true);
 		sched_domains_numa_masks_set(cpu);
 		cpuset_cpu_active();
+		/*
+		 * Preallocated scratch cpumask
+		 */
+		if (!rq->scratch_mask)
+			rq->scratch_mask = kmalloc(cpumask_size(), GFP_KERNEL);
 	}
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a49c17e1c7ea..66a6bfddd716 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1159,6 +1159,9 @@ struct rq {
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
 #endif
+
+	/* Scratch cpumask to be temporarily used under rq_lock */
+	struct cpumask		*scratch_mask;
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race
  2022-08-26  1:01 [PATCH v6 0/5] sched: Persistent user requested affinity Waiman Long
                   ` (2 preceding siblings ...)
  2022-08-26  1:01 ` [PATCH v6 3/5] sched: Enforce user requested affinity Waiman Long
@ 2022-08-26  1:01 ` Waiman Long
  2022-08-31  9:26   ` Peter Zijlstra
  2022-08-31  9:47   ` Peter Zijlstra
  2022-08-26  1:01 ` [PATCH v6 5/5] sched: Fix sched_setaffinity() and fork/clone() race Waiman Long
  4 siblings, 2 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-26  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity()
or between multiple sched_setaffinity() calls from different CPUs. To
resolve these race conditions, we need to update both user_cpus_ptr
and cpus_mask in a single lock critical section instead of separated
ones. This requires moving the user_cpus_ptr update to
affine_move_task() before doing task_rq_unlock().

A new argument puser_mask is added to affine_move_task(),
__set_cpus_allowed_ptr_locked() and __set_cpus_allowed_ptr() to do that.

Ideally, user_cpus_ptr should only be updated if the sched_setaffinity()
is successful. However, this patch will update user_cpus_ptr when the
first call to __set_cpus_allowed_ptr() is successful. However, if there
is racing between sched_setaffinity() and cpuset update, the subsequent
calls to __set_cpus_allowed_ptr() may fail but the user_cpus_ptr will
still be updated in this corner case.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 66 ++++++++++++++++++++++++++++-----------------
 1 file changed, 42 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c2f548e5369..6cd1177fbcea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2199,7 +2199,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 
 static int __set_cpus_allowed_ptr(struct task_struct *p,
 				  const struct cpumask *new_mask,
-				  u32 flags);
+				  u32 flags, struct cpumask **puser_mask);
 
 static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 {
@@ -2249,7 +2249,7 @@ void migrate_enable(void)
 	 */
 	preempt_disable();
 	if (p->cpus_ptr != &p->cpus_mask)
-		__set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE);
+		__set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE, NULL);
 	/*
 	 * Mustn't clear migration_disabled() until cpus_ptr points back at the
 	 * regular cpus_mask, otherwise things that race (eg.
@@ -2618,6 +2618,15 @@ void release_user_cpus_ptr(struct task_struct *p)
 	kfree(clear_user_cpus_ptr(p));
 }
 
+static inline void swap_user_cpus_ptr(struct task_struct *p,
+				      struct cpumask **puser_mask)
+{
+	if (!puser_mask)
+		return;
+
+	swap(p->user_cpus_ptr, *puser_mask);
+}
+
 /*
  * This function is wildly self concurrent; here be dragons.
  *
@@ -2693,9 +2702,12 @@ void release_user_cpus_ptr(struct task_struct *p)
  * Note that the above is safe vs a concurrent migrate_enable(), as any
  * pending affinity completion is preceded by an uninstallation of
  * p->migration_pending done with p->pi_lock held.
+ *
+ * The puser_mask pointer, if defined, will cause its swap with the current
+ * user_cpus_ptr value if operation succeeds.
  */
 static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf,
-			    int dest_cpu, unsigned int flags)
+			    int dest_cpu, unsigned int flags, struct cpumask **puser_mask)
 	__releases(rq->lock)
 	__releases(p->pi_lock)
 {
@@ -2722,6 +2734,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 			complete = true;
 		}
 
+		swap_user_cpus_ptr(p, puser_mask);
 		task_rq_unlock(rq, p, rf);
 
 		if (push_task) {
@@ -2793,6 +2806,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 		if (flags & SCA_MIGRATE_ENABLE)
 			p->migration_flags &= ~MDF_PUSH;
 
+		swap_user_cpus_ptr(p, puser_mask);
 		task_rq_unlock(rq, p, rf);
 
 		if (!stop_pending) {
@@ -2813,6 +2827,8 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 				complete = true;
 			}
 		}
+
+		swap_user_cpus_ptr(p, puser_mask);
 		task_rq_unlock(rq, p, rf);
 
 		if (complete)
@@ -2843,7 +2859,8 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 					 const struct cpumask *new_mask,
 					 u32 flags,
 					 struct rq *rq,
-					 struct rq_flags *rf)
+					 struct rq_flags *rf,
+					 struct cpumask **puser_mask)
 	__releases(rq->lock)
 	__releases(p->pi_lock)
 {
@@ -2908,7 +2925,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, new_mask, flags);
 
-	return affine_move_task(rq, p, rf, dest_cpu, flags);
+	return affine_move_task(rq, p, rf, dest_cpu, flags, puser_mask);
 
 out:
 	task_rq_unlock(rq, p, rf);
@@ -2926,7 +2943,8 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
  * call is not atomic; no spinlocks may be held.
  */
 static int __set_cpus_allowed_ptr(struct task_struct *p,
-				  const struct cpumask *new_mask, u32 flags)
+				  const struct cpumask *new_mask, u32 flags,
+				  struct cpumask **puser_mask)
 {
 	struct cpumask *alloc_mask = NULL;
 	struct rq_flags rf;
@@ -2934,8 +2952,11 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	int ret;
 
 	rq = task_rq_lock(p, &rf);
-	if (p->user_cpus_ptr) {
 
+	/*
+	 * user_cpus_ptr masking is skipped if puser_mask is defined.
+	 */
+	if (p->user_cpus_ptr && !puser_mask) {
 		/*
 		 * A scratch cpumask is allocated on the percpu runqueues
 		 * to enable additional masking with user_cpus_ptr. This
@@ -2958,7 +2979,8 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	}
 
 
-	ret = __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+	ret = __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf,
+					    puser_mask);
 	if (unlikely(alloc_mask))
 		kfree(alloc_mask);
 	return ret;
@@ -2966,7 +2988,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 
 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 {
-	return __set_cpus_allowed_ptr(p, new_mask, 0);
+	return __set_cpus_allowed_ptr(p, new_mask, 0, NULL);
 }
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
@@ -3004,7 +3026,7 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 		goto err_unlock;
 	}
 
-	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
+	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf, NULL);
 
 err_unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -3551,7 +3573,7 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
 
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
 					 const struct cpumask *new_mask,
-					 u32 flags)
+					 u32 flags, struct cpumask **puser_mask)
 {
 	return set_cpus_allowed_ptr(p, new_mask);
 }
@@ -8109,29 +8131,25 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask, bool save
 		}
 		cpumask_copy(user_mask, mask);
 	}
-again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+
+	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK,
+					user_mask ? &user_mask : NULL);
 	if (retval)
 		goto out_free_new_mask;
 
-	cpuset_cpus_allowed(p, cpus_allowed);
-	if (!cpumask_subset(new_mask, cpus_allowed)) {
+	for (;;) {
+		cpuset_cpus_allowed(p, cpus_allowed);
+		if (cpumask_subset(new_mask, cpus_allowed))
+			break;
+
 		/*
 		 * We must have raced with a concurrent cpuset update.
 		 * Just reset the cpumask to the cpuset's cpus_allowed.
 		 */
 		cpumask_copy(new_mask, cpus_allowed);
-		goto again;
+		retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK, NULL);
 	}
 
-	if (save_mask) {
-		unsigned long flags;
-
-		/* Use pi_lock to synchronize changes to user_cpus_ptr */
-		raw_spin_lock_irqsave(&p->pi_lock, flags);
-		swap(p->user_cpus_ptr, user_mask);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-	}
 out_free_new_mask:
 	kfree(user_mask);
 	free_cpumask_var(new_mask);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 5/5] sched: Fix sched_setaffinity() and fork/clone() race
  2022-08-26  1:01 [PATCH v6 0/5] sched: Persistent user requested affinity Waiman Long
                   ` (3 preceding siblings ...)
  2022-08-26  1:01 ` [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race Waiman Long
@ 2022-08-26  1:01 ` Waiman Long
  4 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-26  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

sched_setaffinity() can also race with a concurrent fork/clone() syscall
calling dup_user_cpus_ptr(). That may lead to a use after free problem.
Fix that by protecting the cpumask copying using pi_lock of the source
task.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6cd1177fbcea..71a3fdcb6b6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2593,6 +2593,8 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
 		      int node)
 {
+	unsigned long flags;
+
 	if (!src->user_cpus_ptr)
 		return 0;
 
@@ -2600,7 +2602,10 @@ int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
 	if (!dst->user_cpus_ptr)
 		return -ENOMEM;
 
+	/* Use pi_lock to protect content of user_cpus_ptr */
+	raw_spin_lock_irqsave(&src->pi_lock, flags);
 	cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+	raw_spin_unlock_irqrestore(&src->pi_lock, flags);
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2022-08-26  1:01 ` [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
@ 2022-08-31  9:12   ` Peter Zijlstra
  2022-08-31 20:46     ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2022-08-31  9:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Aug 25, 2022 at 09:01:16PM -0400, Waiman Long wrote:


>  void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>  {
> -	struct cpumask *user_mask = p->user_cpus_ptr;
> -	unsigned long flags;
> -
>  	/*
> -	 * Try to restore the old affinity mask. If this fails, then
> -	 * we free the mask explicitly to avoid it being inherited across
> -	 * a subsequent fork().
> +	 * Try to restore the old affinity mask with __sched_setaffinity().
> +	 * Cpuset masking will be done there too.
>  	 */
> -	if (!user_mask || !__sched_setaffinity(p, user_mask))
> -		return;
> -
> -	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	user_mask = clear_user_cpus_ptr(p);
> -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> -
> -	kfree(user_mask);
> +	__sched_setaffinity(p, task_user_cpus(p), false);
>  }

We have an issue with __sched_setaffinity() failing here. I'm not sure
ignoring the failure is the right thing -- but I'm also not enturely
sure what is.

>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> @@ -8081,10 +8046,11 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
>  #endif
>  
>  static int
> -__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
> +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, bool save_mask)
>  {
>  	int retval;
>  	cpumask_var_t cpus_allowed, new_mask;
> +	struct cpumask *user_mask = NULL;
>  
>  	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
>  		return -ENOMEM;
> @@ -8100,8 +8066,22 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
>  	retval = dl_task_check_affinity(p, new_mask);
>  	if (retval)
>  		goto out_free_new_mask;
> +
> +	/*
> +	 * Save the user requested mask internally now and then update
> +	 * user_cpus_ptr later after making sure this call will be
> +	 * successful, i.e. retval == 0.
> +	 */
> +	if (save_mask) {
> +		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
> +		if (!user_mask) {
> +			retval = -ENOMEM;
> +			goto out_free_new_mask;
> +		}
> +		cpumask_copy(user_mask, mask);
> +	}
>  again:
> -	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
> +	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
>  	if (retval)
>  		goto out_free_new_mask;
>  
> @@ -8115,7 +8095,16 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
>  		goto again;
>  	}
>  
> +	if (save_mask) {
> +		unsigned long flags;
> +
> +		/* Use pi_lock to synchronize changes to user_cpus_ptr */
> +		raw_spin_lock_irqsave(&p->pi_lock, flags);
> +		swap(p->user_cpus_ptr, user_mask);
> +		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +	}
>  out_free_new_mask:
> +	kfree(user_mask);
>  	free_cpumask_var(new_mask);
>  out_free_cpus_allowed:
>  	free_cpumask_var(cpus_allowed);

I'm confused as to why it's put in this function and not in the one
caller that actually sets the new @save_mask true, here:

> @@ -8158,7 +8147,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
>  	if (retval)
>  		goto out_put_task;
>  
> -	retval = __sched_setaffinity(p, in_mask);
> +	retval = __sched_setaffinity(p, in_mask, true);
>  out_put_task:
>  	put_task_struct(p);
>  	return retval;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/5] sched: Enforce user requested affinity
  2022-08-26  1:01 ` [PATCH v6 3/5] sched: Enforce user requested affinity Waiman Long
@ 2022-08-31  9:14   ` Peter Zijlstra
  2022-08-31  9:18   ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2022-08-31  9:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Aug 25, 2022 at 09:01:17PM -0400, Waiman Long wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ac2b103d69dc..1c2f548e5369 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2928,11 +2928,40 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
>  static int __set_cpus_allowed_ptr(struct task_struct *p,
>  				  const struct cpumask *new_mask, u32 flags)
>  {
> +	struct cpumask *alloc_mask = NULL;
>  	struct rq_flags rf;
>  	struct rq *rq;
> +	int ret;
>  
>  	rq = task_rq_lock(p, &rf);
> -	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
> +	if (p->user_cpus_ptr) {
> +
> +		/*
> +		 * A scratch cpumask is allocated on the percpu runqueues
> +		 * to enable additional masking with user_cpus_ptr. This
> +		 * cpumask, once allocated, will not be freed.
> +		 */
> +		if (unlikely(!rq->scratch_mask)) {
> +			alloc_mask = kmalloc(cpumask_size(), GFP_ATOMIC);

This -- absolutely not. You can't have allocations under a
raw_spinlock_t.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/5] sched: Enforce user requested affinity
  2022-08-26  1:01 ` [PATCH v6 3/5] sched: Enforce user requested affinity Waiman Long
  2022-08-31  9:14   ` Peter Zijlstra
@ 2022-08-31  9:18   ` Peter Zijlstra
  2022-08-31  9:21     ` Peter Zijlstra
  2022-08-31 20:48     ` Waiman Long
  1 sibling, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2022-08-31  9:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Aug 25, 2022 at 09:01:17PM -0400, Waiman Long wrote:

> @@ -9352,6 +9381,11 @@ int sched_cpu_activate(unsigned int cpu)
>  		sched_update_numa(cpu, true);
>  		sched_domains_numa_masks_set(cpu);
>  		cpuset_cpu_active();
> +		/*
> +		 * Preallocated scratch cpumask
> +		 */
> +		if (!rq->scratch_mask)
> +			rq->scratch_mask = kmalloc(cpumask_size(), GFP_KERNEL);
>  	}

this is too late; I think you'll have to add a sched_cpu_prepare() and
simply fail the cpu-up when the allocation fails.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/5] sched: Enforce user requested affinity
  2022-08-31  9:18   ` Peter Zijlstra
@ 2022-08-31  9:21     ` Peter Zijlstra
  2022-08-31 21:00       ` Waiman Long
  2022-08-31 20:48     ` Waiman Long
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2022-08-31  9:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Wed, Aug 31, 2022 at 11:18:22AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 25, 2022 at 09:01:17PM -0400, Waiman Long wrote:
> 
> > @@ -9352,6 +9381,11 @@ int sched_cpu_activate(unsigned int cpu)
> >  		sched_update_numa(cpu, true);
> >  		sched_domains_numa_masks_set(cpu);
> >  		cpuset_cpu_active();
> > +		/*
> > +		 * Preallocated scratch cpumask
> > +		 */
> > +		if (!rq->scratch_mask)
> > +			rq->scratch_mask = kmalloc(cpumask_size(), GFP_KERNEL);
> >  	}
> 
> this is too late; I think you'll have to add a sched_cpu_prepare() and
> simply fail the cpu-up when the allocation fails.

Alternatively, waste some more memory and add yet another per-cpu
cpumask.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race
  2022-08-26  1:01 ` [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race Waiman Long
@ 2022-08-31  9:26   ` Peter Zijlstra
  2022-08-31 20:53     ` Waiman Long
  2022-08-31  9:47   ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2022-08-31  9:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Aug 25, 2022 at 09:01:18PM -0400, Waiman Long wrote:
> Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity()
> or between multiple sched_setaffinity() calls from different CPUs. To
> resolve these race conditions, we need to update both user_cpus_ptr
> and cpus_mask in a single lock critical section instead of separated
> ones. This requires moving the user_cpus_ptr update to
> affine_move_task() before doing task_rq_unlock().
> 
> A new argument puser_mask is added to affine_move_task(),
> __set_cpus_allowed_ptr_locked() and __set_cpus_allowed_ptr() to do that.
> 
> Ideally, user_cpus_ptr should only be updated if the sched_setaffinity()
> is successful. However, this patch will update user_cpus_ptr when the
> first call to __set_cpus_allowed_ptr() is successful. However, if there
> is racing between sched_setaffinity() and cpuset update, the subsequent
> calls to __set_cpus_allowed_ptr() may fail but the user_cpus_ptr will
> still be updated in this corner case.

Urgh, this is a bit weird, to have a fix for a patch in the same series.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race
  2022-08-26  1:01 ` [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race Waiman Long
  2022-08-31  9:26   ` Peter Zijlstra
@ 2022-08-31  9:47   ` Peter Zijlstra
  2022-08-31 20:56     ` Waiman Long
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2022-08-31  9:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Aug 25, 2022 at 09:01:18PM -0400, Waiman Long wrote:
> @@ -2722,6 +2734,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  			complete = true;
>  		}
>  
> +		swap_user_cpus_ptr(p, puser_mask);
>  		task_rq_unlock(rq, p, rf);
>  
>  		if (push_task) {
> @@ -2793,6 +2806,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  		if (flags & SCA_MIGRATE_ENABLE)
>  			p->migration_flags &= ~MDF_PUSH;
>  
> +		swap_user_cpus_ptr(p, puser_mask);
>  		task_rq_unlock(rq, p, rf);
>  
>  		if (!stop_pending) {
> @@ -2813,6 +2827,8 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  				complete = true;
>  			}
>  		}
> +
> +		swap_user_cpus_ptr(p, puser_mask);
>  		task_rq_unlock(rq, p, rf);
>  
>  		if (complete)

I'm not at all sure about those.

Would it not be much simpler to keep the update of cpus_mask and
cpus_user_mask together, always ensuring that cpus_user_mask is a strict
superset of cpus_mask ? That is, set_cpus_allowed_common() seems like
the right place to me.

I'm thinking this also means blowing away user_mask when we do a full
reset of the cpus_mask when we do an affnity break.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2022-08-31  9:12   ` Peter Zijlstra
@ 2022-08-31 20:46     ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-31 20:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan


On 8/31/22 05:12, Peter Zijlstra wrote:
> On Thu, Aug 25, 2022 at 09:01:16PM -0400, Waiman Long wrote:
>
>
>>   void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>   {
>> -	struct cpumask *user_mask = p->user_cpus_ptr;
>> -	unsigned long flags;
>> -
>>   	/*
>> -	 * Try to restore the old affinity mask. If this fails, then
>> -	 * we free the mask explicitly to avoid it being inherited across
>> -	 * a subsequent fork().
>> +	 * Try to restore the old affinity mask with __sched_setaffinity().
>> +	 * Cpuset masking will be done there too.
>>   	 */
>> -	if (!user_mask || !__sched_setaffinity(p, user_mask))
>> -		return;
>> -
>> -	raw_spin_lock_irqsave(&p->pi_lock, flags);
>> -	user_mask = clear_user_cpus_ptr(p);
>> -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>> -
>> -	kfree(user_mask);
>> +	__sched_setaffinity(p, task_user_cpus(p), false);
>>   }
> We have an issue with __sched_setaffinity() failing here. I'm not sure
> ignoring the failure is the right thing -- but I'm also not enturely
> sure what is.
I am not sure what we can do in case __sched_setaffinity() fails. Maybe 
we can print a warning when this happen. What do you think?
>>   void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> @@ -8081,10 +8046,11 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
>>   #endif
>>   
>>   static int
>> -__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
>> +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, bool save_mask)
>>   {
>>   	int retval;
>>   	cpumask_var_t cpus_allowed, new_mask;
>> +	struct cpumask *user_mask = NULL;
>>   
>>   	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
>>   		return -ENOMEM;
>> @@ -8100,8 +8066,22 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
>>   	retval = dl_task_check_affinity(p, new_mask);
>>   	if (retval)
>>   		goto out_free_new_mask;
>> +
>> +	/*
>> +	 * Save the user requested mask internally now and then update
>> +	 * user_cpus_ptr later after making sure this call will be
>> +	 * successful, i.e. retval == 0.
>> +	 */
>> +	if (save_mask) {
>> +		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
>> +		if (!user_mask) {
>> +			retval = -ENOMEM;
>> +			goto out_free_new_mask;
>> +		}
>> +		cpumask_copy(user_mask, mask);
>> +	}
>>   again:
>> -	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
>> +	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
>>   	if (retval)
>>   		goto out_free_new_mask;
>>   
>> @@ -8115,7 +8095,16 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
>>   		goto again;
>>   	}
>>   
>> +	if (save_mask) {
>> +		unsigned long flags;
>> +
>> +		/* Use pi_lock to synchronize changes to user_cpus_ptr */
>> +		raw_spin_lock_irqsave(&p->pi_lock, flags);
>> +		swap(p->user_cpus_ptr, user_mask);
>> +		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>> +	}
>>   out_free_new_mask:
>> +	kfree(user_mask);
>>   	free_cpumask_var(new_mask);
>>   out_free_cpus_allowed:
>>   	free_cpumask_var(cpus_allowed);
> I'm confused as to why it's put in this function and not in the one
> caller that actually sets the new @save_mask true, here:

Looking at this patch alone, we can certainly put mask saving in 
sched_setaffinity(). In later patches, however, I have to make 
user_cpus_ptr update in the same lock critical section as cpus_mask. 
That is the reason why it is done this way here. I can certainly make 
your suggested change in this patch and then move the saving inside in a 
later patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/5] sched: Enforce user requested affinity
  2022-08-31  9:18   ` Peter Zijlstra
  2022-08-31  9:21     ` Peter Zijlstra
@ 2022-08-31 20:48     ` Waiman Long
  1 sibling, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-31 20:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan


On 8/31/22 05:18, Peter Zijlstra wrote:
> On Thu, Aug 25, 2022 at 09:01:17PM -0400, Waiman Long wrote:
>
>> @@ -9352,6 +9381,11 @@ int sched_cpu_activate(unsigned int cpu)
>>   		sched_update_numa(cpu, true);
>>   		sched_domains_numa_masks_set(cpu);
>>   		cpuset_cpu_active();
>> +		/*
>> +		 * Preallocated scratch cpumask
>> +		 */
>> +		if (!rq->scratch_mask)
>> +			rq->scratch_mask = kmalloc(cpumask_size(), GFP_KERNEL);
>>   	}
> this is too late; I think you'll have to add a sched_cpu_prepare() and
> simply fail the cpu-up when the allocation fails.
>
That is the reason why I have a fallback allocation in 
__set_cpus_allowed_ptr(). Thanks for the suggestion, I will looking into 
doing that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race
  2022-08-31  9:26   ` Peter Zijlstra
@ 2022-08-31 20:53     ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-31 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan


On 8/31/22 05:26, Peter Zijlstra wrote:
> On Thu, Aug 25, 2022 at 09:01:18PM -0400, Waiman Long wrote:
>> Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity()
>> or between multiple sched_setaffinity() calls from different CPUs. To
>> resolve these race conditions, we need to update both user_cpus_ptr
>> and cpus_mask in a single lock critical section instead of separated
>> ones. This requires moving the user_cpus_ptr update to
>> affine_move_task() before doing task_rq_unlock().
>>
>> A new argument puser_mask is added to affine_move_task(),
>> __set_cpus_allowed_ptr_locked() and __set_cpus_allowed_ptr() to do that.
>>
>> Ideally, user_cpus_ptr should only be updated if the sched_setaffinity()
>> is successful. However, this patch will update user_cpus_ptr when the
>> first call to __set_cpus_allowed_ptr() is successful. However, if there
>> is racing between sched_setaffinity() and cpuset update, the subsequent
>> calls to __set_cpus_allowed_ptr() may fail but the user_cpus_ptr will
>> still be updated in this corner case.
> Urgh, this is a bit weird, to have a fix for a patch in the same series.

This is just to make each patch simpler and easier to read.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race
  2022-08-31  9:47   ` Peter Zijlstra
@ 2022-08-31 20:56     ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-31 20:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On 8/31/22 05:47, Peter Zijlstra wrote:
> On Thu, Aug 25, 2022 at 09:01:18PM -0400, Waiman Long wrote:
>> @@ -2722,6 +2734,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>>   			complete = true;
>>   		}
>>   
>> +		swap_user_cpus_ptr(p, puser_mask);
>>   		task_rq_unlock(rq, p, rf);
>>   
>>   		if (push_task) {
>> @@ -2793,6 +2806,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>>   		if (flags & SCA_MIGRATE_ENABLE)
>>   			p->migration_flags &= ~MDF_PUSH;
>>   
>> +		swap_user_cpus_ptr(p, puser_mask);
>>   		task_rq_unlock(rq, p, rf);
>>   
>>   		if (!stop_pending) {
>> @@ -2813,6 +2827,8 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>>   				complete = true;
>>   			}
>>   		}
>> +
>> +		swap_user_cpus_ptr(p, puser_mask);
>>   		task_rq_unlock(rq, p, rf);
>>   
>>   		if (complete)
> I'm not at all sure about those.
>
> Would it not be much simpler to keep the update of cpus_mask and
> cpus_user_mask together, always ensuring that cpus_user_mask is a strict
> superset of cpus_mask ? That is, set_cpus_allowed_common() seems like
> the right place to me.
>
> I'm thinking this also means blowing away user_mask when we do a full
> reset of the cpus_mask when we do an affnity break.

Thanks for the suggestion. I will need to think about what will be best 
way to pass in those additional parameters there.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/5] sched: Enforce user requested affinity
  2022-08-31  9:21     ` Peter Zijlstra
@ 2022-08-31 21:00       ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2022-08-31 21:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan


On 8/31/22 05:21, Peter Zijlstra wrote:
> On Wed, Aug 31, 2022 at 11:18:22AM +0200, Peter Zijlstra wrote:
>> On Thu, Aug 25, 2022 at 09:01:17PM -0400, Waiman Long wrote:
>>
>>> @@ -9352,6 +9381,11 @@ int sched_cpu_activate(unsigned int cpu)
>>>   		sched_update_numa(cpu, true);
>>>   		sched_domains_numa_masks_set(cpu);
>>>   		cpuset_cpu_active();
>>> +		/*
>>> +		 * Preallocated scratch cpumask
>>> +		 */
>>> +		if (!rq->scratch_mask)
>>> +			rq->scratch_mask = kmalloc(cpumask_size(), GFP_KERNEL);
>>>   	}
>> this is too late; I think you'll have to add a sched_cpu_prepare() and
>> simply fail the cpu-up when the allocation fails.
> Alternatively, waste some more memory and add yet another per-cpu
> cpumask.

A percpu cpumask is probably a better idea. However, I don't need that 
as early as the other cpumasks like __cpu_possible_mask. Maybe I can do 
a percpu memory allocation early in the pre-smp boot.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-08-31 21:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  1:01 [PATCH v6 0/5] sched: Persistent user requested affinity Waiman Long
2022-08-26  1:01 ` [PATCH v6 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
2022-08-26  1:01 ` [PATCH v6 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
2022-08-31  9:12   ` Peter Zijlstra
2022-08-31 20:46     ` Waiman Long
2022-08-26  1:01 ` [PATCH v6 3/5] sched: Enforce user requested affinity Waiman Long
2022-08-31  9:14   ` Peter Zijlstra
2022-08-31  9:18   ` Peter Zijlstra
2022-08-31  9:21     ` Peter Zijlstra
2022-08-31 21:00       ` Waiman Long
2022-08-31 20:48     ` Waiman Long
2022-08-26  1:01 ` [PATCH v6 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race Waiman Long
2022-08-31  9:26   ` Peter Zijlstra
2022-08-31 20:53     ` Waiman Long
2022-08-31  9:47   ` Peter Zijlstra
2022-08-31 20:56     ` Waiman Long
2022-08-26  1:01 ` [PATCH v6 5/5] sched: Fix sched_setaffinity() and fork/clone() race Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).