linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/5] sched: Persistent user requested affinity
@ 2022-09-22 18:00 Waiman Long
  2022-09-22 18:00 ` [PATCH v10 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Waiman Long @ 2022-09-22 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

v10:
 - Squash patches 4, 5 & 6 together into a single patch as suggested
   by PeterZ.

v9:
 - Fix non-SMP config build errors in patch 4 reported by kernel test
   robot.
 - Disable user_cpus_ptr masking also when any of the SCA_MIGRATE_*
   flags are set.

v8:
 - Add patches "sched: Introduce affinity_context structure" and
   "sched: Always clear user_cpus_ptr in do_set_cpus_allowed()" as
   suggested by PeterZ.
 - Better handle cpuset and sched_setaffinity() race in patch 5.

v7:
 - Make changes recommended by Peter such as making scratch_mask
   allocation earlier and using percpu variable to pass information.

The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
Introduce task_struct::user_cpus_ptr to track requested affinity")
which uses it narrowly to allow keeping cpus affinity intact with
asymmetric cpu setup.

This patchset extends user_cpus_ptr to store user requested cpus affinity
via the sched_setaffinity() API. With that information available, it will
enable cpuset and other callers of set_cpus_allowed_ptr() like hotplug
to keep cpus afinity as close to what the user wants as possible within
the cpu list constraint of the current cpuset. Otherwise some change
to the cpuset hierarchy or a hotplug event may reset the cpumask of
the tasks in the affected cpusets to the default cpuset value even if
those tasks have cpus affinity explicitly set by the users before.

It also means that once sched_setaffinity() is called successfully,
user_cpus_ptr will remain allocated until the task exits except in some
rare circumstances.

Waiman Long (5):
  sched: Add __releases annotations to affine_move_task()
  sched: Use user_cpus_ptr for saving user provided cpumask in
    sched_setaffinity()
  sched: Enforce user requested affinity
  sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other
    races
  sched: Always clear user_cpus_ptr in do_set_cpus_allowed()

 kernel/sched/core.c     | 231 +++++++++++++++++++++++++---------------
 kernel/sched/deadline.c |   7 +-
 kernel/sched/sched.h    |  22 +++-
 3 files changed, 169 insertions(+), 91 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v10 1/5] sched: Add __releases annotations to affine_move_task()
  2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
@ 2022-09-22 18:00 ` Waiman Long
  2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
  2022-09-22 18:00 ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Waiman Long @ 2022-09-22 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

affine_move_task() assumes task_rq_lock() has been called and it does
an implicit task_rq_unlock() before returning. Add the appropriate
__releases annotations to make this clear.

A typo error in comment is also fixed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..b351e6d173b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2696,6 +2696,8 @@ void release_user_cpus_ptr(struct task_struct *p)
  */
 static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf,
 			    int dest_cpu, unsigned int flags)
+	__releases(rq->lock)
+	__releases(p->pi_lock)
 {
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
@@ -3005,7 +3007,7 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 
 /*
  * Restrict the CPU affinity of task @p so that it is a subset of
- * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the
+ * task_cpu_possible_mask() and point @p->user_cpus_ptr to a copy of the
  * old affinity mask. If the resulting mask is empty, we warn and walk
  * up the cpuset hierarchy until we find a suitable mask.
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
  2022-09-22 18:00 ` [PATCH v10 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
@ 2022-09-22 18:00 ` Waiman Long
  2022-10-28  6:42   ` [tip: sched/core] sched: Always preserve the user requested cpumask tip-bot2 for Waiman Long
  2023-01-17 16:08   ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Will Deacon
  2022-09-22 18:00 ` [PATCH v10 3/5] sched: Enforce user requested affinity Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Waiman Long @ 2022-09-22 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
Introduce task_struct::user_cpus_ptr to track requested affinity"). It
is currently used only by arm64 arch due to possible asymmetric CPU
setup. This patch extends its usage to save user provided cpumask
when sched_setaffinity() is called for all arches. With this patch
applied, user_cpus_ptr, once allocated after a successful call to
sched_setaffinity(), will only be freed when the task exits.

Since user_cpus_ptr is supposed to be used for "requested
affinity", there is actually no point to save current cpu affinity in
restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
if defined but not changing it.

This will be some changes in behavior for arm64 systems with asymmetric
CPUs in some corner cases. For instance, if sched_setaffinity()
has never been called and there is a cpuset change before
relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
follow what the cpuset allows but not what the previous cpu affinity
setting allows.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c  | 82 ++++++++++++++++++++------------------------
 kernel/sched/sched.h |  7 ++++
 2 files changed, 44 insertions(+), 45 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b351e6d173b7..c7c0425974c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2850,7 +2850,6 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
 	const struct cpumask *cpu_valid_mask = cpu_active_mask;
 	bool kthread = p->flags & PF_KTHREAD;
-	struct cpumask *user_mask = NULL;
 	unsigned int dest_cpu;
 	int ret = 0;
 
@@ -2909,14 +2908,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, new_mask, flags);
 
-	if (flags & SCA_USER)
-		user_mask = clear_user_cpus_ptr(p);
-
-	ret = affine_move_task(rq, p, rf, dest_cpu, flags);
-
-	kfree(user_mask);
-
-	return ret;
+	return affine_move_task(rq, p, rf, dest_cpu, flags);
 
 out:
 	task_rq_unlock(rq, p, rf);
@@ -2951,8 +2943,10 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
  * Change a given task's CPU affinity to the intersection of its current
- * affinity mask and @subset_mask, writing the resulting mask to @new_mask
- * and pointing @p->user_cpus_ptr to a copy of the old mask.
+ * affinity mask and @subset_mask, writing the resulting mask to @new_mask.
+ * If user_cpus_ptr is defined, use it as the basis for restricting CPU
+ * affinity or use cpu_online_mask instead.
+ *
  * If the resulting mask is empty, leave the affinity unchanged and return
  * -EINVAL.
  */
@@ -2960,17 +2954,10 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 				     struct cpumask *new_mask,
 				     const struct cpumask *subset_mask)
 {
-	struct cpumask *user_mask = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 	int err;
 
-	if (!p->user_cpus_ptr) {
-		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
-		if (!user_mask)
-			return -ENOMEM;
-	}
-
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -2983,25 +2970,15 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 		goto err_unlock;
 	}
 
-	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
+	if (!cpumask_and(new_mask, task_user_cpus(p), subset_mask)) {
 		err = -EINVAL;
 		goto err_unlock;
 	}
 
-	/*
-	 * We're about to butcher the task affinity, so keep track of what
-	 * the user asked for in case we're able to restore it later on.
-	 */
-	if (user_mask) {
-		cpumask_copy(user_mask, p->cpus_ptr);
-		p->user_cpus_ptr = user_mask;
-	}
-
 	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
 
 err_unlock:
 	task_rq_unlock(rq, p, &rf);
-	kfree(user_mask);
 	return err;
 }
 
@@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
 
 /*
  * Restore the affinity of a task @p which was previously restricted by a
- * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
- * @p->user_cpus_ptr.
+ * call to force_compatible_cpus_allowed_ptr().
  *
  * It is the caller's responsibility to serialise this with any calls to
  * force_compatible_cpus_allowed_ptr(@p).
  */
 void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 {
-	struct cpumask *user_mask = p->user_cpus_ptr;
-	unsigned long flags;
+	int ret;
 
 	/*
-	 * Try to restore the old affinity mask. If this fails, then
-	 * we free the mask explicitly to avoid it being inherited across
-	 * a subsequent fork().
+	 * Try to restore the old affinity mask with __sched_setaffinity().
+	 * Cpuset masking will be done there too.
 	 */
-	if (!user_mask || !__sched_setaffinity(p, user_mask))
-		return;
-
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	user_mask = clear_user_cpus_ptr(p);
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-
-	kfree(user_mask);
+	ret = __sched_setaffinity(p, task_user_cpus(p));
+	WARN_ON_ONCE(ret);
 }
 
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
@@ -8101,7 +8069,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	if (retval)
 		goto out_free_new_mask;
 again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
+	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
 	if (retval)
 		goto out_free_new_mask;
 
@@ -8124,6 +8092,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
+	struct cpumask *user_mask;
 	struct task_struct *p;
 	int retval;
 
@@ -8158,7 +8127,30 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_put_task;
 
+	user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
+	if (!user_mask) {
+		retval = -ENOMEM;
+		goto out_put_task;
+	}
+	cpumask_copy(user_mask, in_mask);
+
 	retval = __sched_setaffinity(p, in_mask);
+
+	/*
+	 * Save in_mask into user_cpus_ptr after a successful
+	 * __sched_setaffinity() call. pi_lock is used to synchronize
+	 * changes to user_cpus_ptr.
+	 */
+	if (!retval) {
+		unsigned long flags;
+
+		/* Use pi_lock to synchronize changes to user_cpus_ptr */
+		raw_spin_lock_irqsave(&p->pi_lock, flags);
+		swap(p->user_cpus_ptr, user_mask);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	}
+	kfree(user_mask);
+
 out_put_task:
 	put_task_struct(p);
 	return retval;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..ac235bc8ef08 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1881,6 +1881,13 @@ static inline void dirty_sched_domain_sysctl(int cpu)
 #endif
 
 extern int sched_update_scaling(void);
+
+static inline const struct cpumask *task_user_cpus(struct task_struct *p)
+{
+	if (!p->user_cpus_ptr)
+		return cpu_possible_mask; /* &init_task.cpus_mask */
+	return p->user_cpus_ptr;
+}
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v10 3/5] sched: Enforce user requested affinity
  2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
  2022-09-22 18:00 ` [PATCH v10 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
  2022-09-22 18:00 ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
@ 2022-09-22 18:00 ` Waiman Long
  2022-10-07 10:01   ` Peter Zijlstra
  2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
  2022-09-22 18:00 ` [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Waiman Long @ 2022-09-22 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

It was found that the user requested affinity via sched_setaffinity()
can be easily overwritten by other kernel subsystems without an easy way
to reset it back to what the user requested. For example, any change
to the current cpuset hierarchy may reset the cpumask of the tasks in
the affected cpusets to the default cpuset value even if those tasks
have pre-existing user requested affinity. That is especially easy to
trigger under a cgroup v2 environment where writing "+cpuset" to the
root cgroup's cgroup.subtree_control file will reset the cpus affinity
of all the processes in the system.

That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.

Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
and use it to restrcit the given cpumask unless there is no overlap. In
that case, it will fallback to the given one. The SCA_USER flag is
reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr
masking should be skipped. In addition, masking should also be skipped
if any of the SCA_MIGRATE_* flag is set.

All callers of set_cpus_allowed_ptr() will be affected by this change.
A scratch cpumask is added to percpu runqueues structure for doing
additional masking when user_cpus_ptr is set.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c  | 22 +++++++++++++++++-----
 kernel/sched/sched.h |  3 +++
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7c0425974c2..ab8e591dbaf5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2932,6 +2932,15 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &rf);
+	/*
+	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
+	 * flags are set.
+	 */
+	if (p->user_cpus_ptr &&
+	    !(flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
+	    cpumask_and(rq->scratch_mask, new_mask, p->user_cpus_ptr))
+		new_mask = rq->scratch_mask;
+
 	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
 }
 
@@ -3028,7 +3037,7 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p)
 }
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags);
 
 /*
  * Restore the affinity of a task @p which was previously restricted by a
@@ -3045,7 +3054,7 @@ void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 	 * Try to restore the old affinity mask with __sched_setaffinity().
 	 * Cpuset masking will be done there too.
 	 */
-	ret = __sched_setaffinity(p, task_user_cpus(p));
+	ret = __sched_setaffinity(p, task_user_cpus(p), 0);
 	WARN_ON_ONCE(ret);
 }
 
@@ -8049,7 +8058,7 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
 #endif
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags)
 {
 	int retval;
 	cpumask_var_t cpus_allowed, new_mask;
@@ -8069,7 +8078,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	if (retval)
 		goto out_free_new_mask;
 again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | flags);
 	if (retval)
 		goto out_free_new_mask;
 
@@ -8134,7 +8143,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	}
 	cpumask_copy(user_mask, in_mask);
 
-	retval = __sched_setaffinity(p, in_mask);
+	retval = __sched_setaffinity(p, in_mask, SCA_USER);
 
 	/*
 	 * Save in_mask into user_cpus_ptr after a successful
@@ -9647,6 +9656,9 @@ void __init sched_init(void)
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 		per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(runqueues.scratch_mask, i) =
+			(cpumask_var_t)kzalloc_node(cpumask_size(),
+						    GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac235bc8ef08..482b702d65ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1159,6 +1159,9 @@ struct rq {
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
 #endif
+
+	/* Scratch cpumask to be temporarily used under rq_lock */
+	cpumask_var_t		scratch_mask;
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races
  2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
                   ` (2 preceding siblings ...)
  2022-09-22 18:00 ` [PATCH v10 3/5] sched: Enforce user requested affinity Waiman Long
@ 2022-09-22 18:00 ` Waiman Long
  2022-10-07 12:47   ` Peter Zijlstra
  2022-10-28  6:42   ` [tip: sched/core] sched: Introduce affinity_context tip-bot2 for Waiman Long
  2022-09-22 18:00 ` [PATCH v10 5/5] sched: Always clear user_cpus_ptr in do_set_cpus_allowed() Waiman Long
  2022-10-06 21:31 ` [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
  5 siblings, 2 replies; 25+ messages in thread
From: Waiman Long @ 2022-09-22 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity()
or between multiple sched_setaffinity() calls from different
CPUs. To resolve these race conditions, we need to update both
user_cpus_ptr and cpus_mask in a single lock critical section instead
of separated ones. This requires moving the user_cpus_ptr update
to set_cpus_allowed_common() by putting the user_mask into a new
affinity_context structure and using it to pass information around
various functions.

This patch also changes the handling of the race between the
sched_setaffinity() call and the changing of cpumask of the current
cpuset. In case the new mask conflicts with newly updated cpuset,
the cpus_mask will be reset to the cpuset cpumask and an error value
of -EINVAL will be returned. If a previous user_cpus_ptr value exists,
it will be swapped back in and the new_mask will be further restricted
to what is allowed in the cpumask pointed to by the old user_cpus_ptr.

The potential race between sched_setaffinity() and a fork/clone()
syscall calling dup_user_cpus_ptr() is also being handled.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c     | 169 ++++++++++++++++++++++++++--------------
 kernel/sched/deadline.c |   7 +-
 kernel/sched/sched.h    |  12 ++-
 3 files changed, 122 insertions(+), 66 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab8e591dbaf5..ce626cad4105 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2195,14 +2195,18 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 #ifdef CONFIG_SMP
 
 static void
-__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32 flags);
+__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx);
 
 static int __set_cpus_allowed_ptr(struct task_struct *p,
-				  const struct cpumask *new_mask,
-				  u32 flags);
+				  struct affinity_context *ctx);
 
 static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 {
+	struct affinity_context ac = {
+		.new_mask  = cpumask_of(rq->cpu),
+		.flags     = SCA_MIGRATE_DISABLE,
+	};
+
 	if (likely(!p->migration_disabled))
 		return;
 
@@ -2212,7 +2216,7 @@ static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 	/*
 	 * Violates locking rules! see comment in __do_set_cpus_allowed().
 	 */
-	__do_set_cpus_allowed(p, cpumask_of(rq->cpu), SCA_MIGRATE_DISABLE);
+	__do_set_cpus_allowed(p, &ac);
 }
 
 void migrate_disable(void)
@@ -2234,6 +2238,10 @@ EXPORT_SYMBOL_GPL(migrate_disable);
 void migrate_enable(void)
 {
 	struct task_struct *p = current;
+	struct affinity_context ac = {
+		.new_mask  = &p->cpus_mask,
+		.flags     = SCA_MIGRATE_ENABLE,
+	};
 
 	if (p->migration_disabled > 1) {
 		p->migration_disabled--;
@@ -2249,7 +2257,7 @@ void migrate_enable(void)
 	 */
 	preempt_disable();
 	if (p->cpus_ptr != &p->cpus_mask)
-		__set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE);
+		__set_cpus_allowed_ptr(p, &ac);
 	/*
 	 * Mustn't clear migration_disabled() until cpus_ptr points back at the
 	 * regular cpus_mask, otherwise things that race (eg.
@@ -2529,19 +2537,25 @@ int push_cpu_stop(void *arg)
  * sched_class::set_cpus_allowed must do the below, but is not required to
  * actually call this function.
  */
-void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask, u32 flags)
+void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx)
 {
-	if (flags & (SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) {
-		p->cpus_ptr = new_mask;
+	if (ctx->flags & (SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) {
+		p->cpus_ptr = ctx->new_mask;
 		return;
 	}
 
-	cpumask_copy(&p->cpus_mask, new_mask);
-	p->nr_cpus_allowed = cpumask_weight(new_mask);
+	cpumask_copy(&p->cpus_mask, ctx->new_mask);
+	p->nr_cpus_allowed = cpumask_weight(ctx->new_mask);
+
+	/*
+	 * Swap in a new user_cpus_ptr if SCA_USER flag set
+	 */
+	if (ctx->flags & SCA_USER)
+		swap(p->user_cpus_ptr, ctx->user_mask);
 }
 
 static void
-__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32 flags)
+__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 {
 	struct rq *rq = task_rq(p);
 	bool queued, running;
@@ -2558,7 +2572,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 	 *
 	 * XXX do further audits, this smells like something putrid.
 	 */
-	if (flags & SCA_MIGRATE_DISABLE)
+	if (ctx->flags & SCA_MIGRATE_DISABLE)
 		SCHED_WARN_ON(!p->on_cpu);
 	else
 		lockdep_assert_held(&p->pi_lock);
@@ -2577,7 +2591,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 	if (running)
 		put_prev_task(rq, p);
 
-	p->sched_class->set_cpus_allowed(p, new_mask, flags);
+	p->sched_class->set_cpus_allowed(p, ctx);
 
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@@ -2587,12 +2601,19 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 
 void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
-	__do_set_cpus_allowed(p, new_mask, 0);
+	struct affinity_context ac = {
+		.new_mask  = new_mask,
+		.flags     = 0,
+	};
+
+	__do_set_cpus_allowed(p, &ac);
 }
 
 int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
 		      int node)
 {
+	unsigned long flags;
+
 	if (!src->user_cpus_ptr)
 		return 0;
 
@@ -2600,7 +2621,10 @@ int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
 	if (!dst->user_cpus_ptr)
 		return -ENOMEM;
 
+	/* Use pi_lock to protect content of user_cpus_ptr */
+	raw_spin_lock_irqsave(&src->pi_lock, flags);
 	cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+	raw_spin_unlock_irqrestore(&src->pi_lock, flags);
 	return 0;
 }
 
@@ -2840,8 +2864,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
  * Called with both p->pi_lock and rq->lock held; drops both before returning.
  */
 static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
-					 const struct cpumask *new_mask,
-					 u32 flags,
+					 struct affinity_context *ctx,
 					 struct rq *rq,
 					 struct rq_flags *rf)
 	__releases(rq->lock)
@@ -2869,7 +2892,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 		cpu_valid_mask = cpu_online_mask;
 	}
 
-	if (!kthread && !cpumask_subset(new_mask, cpu_allowed_mask)) {
+	if (!kthread && !cpumask_subset(ctx->new_mask, cpu_allowed_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -2878,18 +2901,18 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	 * Must re-check here, to close a race against __kthread_bind(),
 	 * sched_setaffinity() is not guaranteed to observe the flag.
 	 */
-	if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
+	if ((ctx->flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (!(flags & SCA_MIGRATE_ENABLE)) {
-		if (cpumask_equal(&p->cpus_mask, new_mask))
+	if (!(ctx->flags & SCA_MIGRATE_ENABLE)) {
+		if (cpumask_equal(&p->cpus_mask, ctx->new_mask))
 			goto out;
 
 		if (WARN_ON_ONCE(p == current &&
 				 is_migration_disabled(p) &&
-				 !cpumask_test_cpu(task_cpu(p), new_mask))) {
+				 !cpumask_test_cpu(task_cpu(p), ctx->new_mask))) {
 			ret = -EBUSY;
 			goto out;
 		}
@@ -2900,15 +2923,15 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	 * for groups of tasks (ie. cpuset), so that load balancing is not
 	 * immediately required to distribute the tasks within their new mask.
 	 */
-	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, new_mask);
+	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
 	if (dest_cpu >= nr_cpu_ids) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	__do_set_cpus_allowed(p, new_mask, flags);
+	__do_set_cpus_allowed(p, ctx);
 
-	return affine_move_task(rq, p, rf, dest_cpu, flags);
+	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
 out:
 	task_rq_unlock(rq, p, rf);
@@ -2926,7 +2949,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
  * call is not atomic; no spinlocks may be held.
  */
 static int __set_cpus_allowed_ptr(struct task_struct *p,
-				  const struct cpumask *new_mask, u32 flags)
+				  struct affinity_context *ctx)
 {
 	struct rq_flags rf;
 	struct rq *rq;
@@ -2937,16 +2960,21 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	 * flags are set.
 	 */
 	if (p->user_cpus_ptr &&
-	    !(flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
-	    cpumask_and(rq->scratch_mask, new_mask, p->user_cpus_ptr))
-		new_mask = rq->scratch_mask;
+	    !(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
+	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
+		ctx->new_mask = rq->scratch_mask;
 
-	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
 }
 
 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 {
-	return __set_cpus_allowed_ptr(p, new_mask, 0);
+	struct affinity_context ac = {
+		.new_mask  = new_mask,
+		.flags     = 0,
+	};
+
+	return __set_cpus_allowed_ptr(p, &ac);
 }
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
@@ -2963,6 +2991,10 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 				     struct cpumask *new_mask,
 				     const struct cpumask *subset_mask)
 {
+	struct affinity_context ac = {
+		.new_mask  = new_mask,
+		.flags     = 0,
+	};
 	struct rq_flags rf;
 	struct rq *rq;
 	int err;
@@ -2984,7 +3016,7 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 		goto err_unlock;
 	}
 
-	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
+	return __set_cpus_allowed_ptr_locked(p, &ac, rq, &rf);
 
 err_unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -3037,7 +3069,7 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p)
 }
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags);
+__sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
 
 /*
  * Restore the affinity of a task @p which was previously restricted by a
@@ -3048,13 +3080,17 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags
  */
 void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 {
+	struct affinity_context ac = {
+		.new_mask  = task_user_cpus(p),
+		.flags     = 0,
+	};
 	int ret;
 
 	/*
 	 * Try to restore the old affinity mask with __sched_setaffinity().
 	 * Cpuset masking will be done there too.
 	 */
-	ret = __sched_setaffinity(p, task_user_cpus(p), 0);
+	ret = __sched_setaffinity(p, &ac);
 	WARN_ON_ONCE(ret);
 }
 
@@ -3533,10 +3569,9 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
 #else /* CONFIG_SMP */
 
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
-					 const struct cpumask *new_mask,
-					 u32 flags)
+					 struct affinity_context *ctx)
 {
-	return set_cpus_allowed_ptr(p, new_mask);
+	return set_cpus_allowed_ptr(p, ctx->new_mask);
 }
 
 static inline void migrate_disable_switch(struct rq *rq, struct task_struct *p) { }
@@ -8058,7 +8093,7 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
 #endif
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags)
+__sched_setaffinity(struct task_struct *p, struct affinity_context *ctx)
 {
 	int retval;
 	cpumask_var_t cpus_allowed, new_mask;
@@ -8072,13 +8107,16 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags
 	}
 
 	cpuset_cpus_allowed(p, cpus_allowed);
-	cpumask_and(new_mask, mask, cpus_allowed);
+	cpumask_and(new_mask, ctx->new_mask, cpus_allowed);
+
+	ctx->new_mask = new_mask;
+	ctx->flags |= SCA_CHECK;
 
 	retval = dl_task_check_affinity(p, new_mask);
 	if (retval)
 		goto out_free_new_mask;
-again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | flags);
+
+	retval = __set_cpus_allowed_ptr(p, ctx);
 	if (retval)
 		goto out_free_new_mask;
 
@@ -8089,7 +8127,24 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags
 		 * Just reset the cpumask to the cpuset's cpus_allowed.
 		 */
 		cpumask_copy(new_mask, cpus_allowed);
-		goto again;
+
+		/*
+		 * If SCA_USER is set, a 2nd call to __set_cpus_allowed_ptr()
+		 * will restore the previous user_cpus_ptr value.
+		 *
+		 * In the unlikely event a previous user_cpus_ptr exists,
+		 * we need to further restrict the mask to what is allowed
+		 * by that old user_cpus_ptr.
+		 */
+		if (unlikely((ctx->flags & SCA_USER) && ctx->user_mask)) {
+			bool empty = !cpumask_and(new_mask, new_mask,
+						  ctx->user_mask);
+
+			if (WARN_ON_ONCE(empty))
+				cpumask_copy(new_mask, cpus_allowed);
+		}
+		__set_cpus_allowed_ptr(p, ctx);
+		retval = -EINVAL;
 	}
 
 out_free_new_mask:
@@ -8101,6 +8156,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int flags
 
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
+	struct affinity_context ac;
 	struct cpumask *user_mask;
 	struct task_struct *p;
 	int retval;
@@ -8142,23 +8198,14 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 		goto out_put_task;
 	}
 	cpumask_copy(user_mask, in_mask);
+	ac = (struct affinity_context){
+		.new_mask  = in_mask,
+		.user_mask = user_mask,
+		.flags     = SCA_USER,
+	};
 
-	retval = __sched_setaffinity(p, in_mask, SCA_USER);
-
-	/*
-	 * Save in_mask into user_cpus_ptr after a successful
-	 * __sched_setaffinity() call. pi_lock is used to synchronize
-	 * changes to user_cpus_ptr.
-	 */
-	if (!retval) {
-		unsigned long flags;
-
-		/* Use pi_lock to synchronize changes to user_cpus_ptr */
-		raw_spin_lock_irqsave(&p->pi_lock, flags);
-		swap(p->user_cpus_ptr, user_mask);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-	}
-	kfree(user_mask);
+	retval = __sched_setaffinity(p, &ac);
+	kfree(ac.user_mask);
 
 out_put_task:
 	put_task_struct(p);
@@ -8940,6 +8987,12 @@ void show_state_filter(unsigned int state_filter)
  */
 void __init init_idle(struct task_struct *idle, int cpu)
 {
+#ifdef CONFIG_SMP
+	struct affinity_context ac = (struct affinity_context) {
+		.new_mask  = cpumask_of(cpu),
+		.flags     = 0,
+	};
+#endif
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
@@ -8964,7 +9017,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	 *
 	 * And since this is boot we can forgo the serialization.
 	 */
-	set_cpus_allowed_common(idle, cpumask_of(cpu), 0);
+	set_cpus_allowed_common(idle, &ac);
 #endif
 	/*
 	 * We're having a chicken and egg problem, even though we are
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0ab79d819a0d..38fa2c3ef7db 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2486,8 +2486,7 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
 }
 
 static void set_cpus_allowed_dl(struct task_struct *p,
-				const struct cpumask *new_mask,
-				u32 flags)
+				struct affinity_context *ctx)
 {
 	struct root_domain *src_rd;
 	struct rq *rq;
@@ -2502,7 +2501,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 	 * update. We already made space for us in the destination
 	 * domain (see cpuset_can_attach()).
 	 */
-	if (!cpumask_intersects(src_rd->span, new_mask)) {
+	if (!cpumask_intersects(src_rd->span, ctx->new_mask)) {
 		struct dl_bw *src_dl_b;
 
 		src_dl_b = dl_bw_of(cpu_of(rq));
@@ -2516,7 +2515,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 		raw_spin_unlock(&src_dl_b->lock);
 	}
 
-	set_cpus_allowed_common(p, new_mask, flags);
+	set_cpus_allowed_common(p, ctx);
 }
 
 /* Assumes rq->lock is held */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 482b702d65ea..110e13b7d78b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2157,6 +2157,12 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+struct affinity_context {
+	const struct cpumask *new_mask;
+	struct cpumask *user_mask;
+	unsigned int flags;
+};
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -2185,9 +2191,7 @@ struct sched_class {
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
 
-	void (*set_cpus_allowed)(struct task_struct *p,
-				 const struct cpumask *newmask,
-				 u32 flags);
+	void (*set_cpus_allowed)(struct task_struct *p, struct affinity_context *ctx);
 
 	void (*rq_online)(struct rq *rq);
 	void (*rq_offline)(struct rq *rq);
@@ -2301,7 +2305,7 @@ extern void update_group_capacity(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
 
-extern void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask, u32 flags);
+extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx);
 
 static inline struct task_struct *get_push_task(struct rq *rq)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v10 5/5] sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
  2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
                   ` (3 preceding siblings ...)
  2022-09-22 18:00 ` [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races Waiman Long
@ 2022-09-22 18:00 ` Waiman Long
  2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
  2022-10-06 21:31 ` [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
  5 siblings, 1 reply; 25+ messages in thread
From: Waiman Long @ 2022-09-22 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Waiman Long

The do_set_cpus_allowed() function is used by either kthread_bind() or
select_fallback_rq(). In both cases the user affinity (if any) should be
destroyed too.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ce626cad4105..a5240c603667 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2599,14 +2599,20 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		set_next_task(rq, p);
 }
 
+/*
+ * Used for kthread_bind() and select_fallback_rq(), in both cases the user
+ * affinity (if any) should be destroyed too.
+ */
 void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
 	struct affinity_context ac = {
 		.new_mask  = new_mask,
-		.flags     = 0,
+		.user_mask = NULL,
+		.flags     = SCA_USER,	/* clear the user requested mask */
 	};
 
 	__do_set_cpus_allowed(p, &ac);
+	kfree(ac.user_mask);
 }
 
 int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 0/5] sched: Persistent user requested affinity
  2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
                   ` (4 preceding siblings ...)
  2022-09-22 18:00 ` [PATCH v10 5/5] sched: Always clear user_cpus_ptr in do_set_cpus_allowed() Waiman Long
@ 2022-10-06 21:31 ` Waiman Long
  5 siblings, 0 replies; 25+ messages in thread
From: Waiman Long @ 2022-10-06 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Lai Jiangshan, Juri Lelli,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Tejun Heo, Zefan Li, Johannes Weiner,
	Will Deacon

On 9/22/22 14:00, Waiman Long wrote:
> v10:
>   - Squash patches 4, 5 & 6 together into a single patch as suggested
>     by PeterZ.
>
> v9:
>   - Fix non-SMP config build errors in patch 4 reported by kernel test
>     robot.
>   - Disable user_cpus_ptr masking also when any of the SCA_MIGRATE_*
>     flags are set.
>
> v8:
>   - Add patches "sched: Introduce affinity_context structure" and
>     "sched: Always clear user_cpus_ptr in do_set_cpus_allowed()" as
>     suggested by PeterZ.
>   - Better handle cpuset and sched_setaffinity() race in patch 5.
>
> v7:
>   - Make changes recommended by Peter such as making scratch_mask
>     allocation earlier and using percpu variable to pass information.
>
> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
> Introduce task_struct::user_cpus_ptr to track requested affinity")
> which uses it narrowly to allow keeping cpus affinity intact with
> asymmetric cpu setup.
>
> This patchset extends user_cpus_ptr to store user requested cpus affinity
> via the sched_setaffinity() API. With that information available, it will
> enable cpuset and other callers of set_cpus_allowed_ptr() like hotplug
> to keep cpus afinity as close to what the user wants as possible within
> the cpu list constraint of the current cpuset. Otherwise some change
> to the cpuset hierarchy or a hotplug event may reset the cpumask of
> the tasks in the affected cpusets to the default cpuset value even if
> those tasks have cpus affinity explicitly set by the users before.
>
> It also means that once sched_setaffinity() is called successfully,
> user_cpus_ptr will remain allocated until the task exits except in some
> rare circumstances.
>
> Waiman Long (5):
>    sched: Add __releases annotations to affine_move_task()
>    sched: Use user_cpus_ptr for saving user provided cpumask in
>      sched_setaffinity()
>    sched: Enforce user requested affinity
>    sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other
>      races
>    sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
>
>   kernel/sched/core.c     | 231 +++++++++++++++++++++++++---------------
>   kernel/sched/deadline.c |   7 +-
>   kernel/sched/sched.h    |  22 +++-
>   3 files changed, 169 insertions(+), 91 deletions(-)
>
Peter, do you have any further suggestion on how to improve this series?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 3/5] sched: Enforce user requested affinity
  2022-09-22 18:00 ` [PATCH v10 3/5] sched: Enforce user requested affinity Waiman Long
@ 2022-10-07 10:01   ` Peter Zijlstra
  2022-10-07 14:57     ` Waiman Long
  2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2022-10-07 10:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Sep 22, 2022 at 02:00:39PM -0400, Waiman Long wrote:
> @@ -9647,6 +9656,9 @@ void __init sched_init(void)
>  			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>  		per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
>  			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> +		per_cpu(runqueues.scratch_mask, i) =
> +			(cpumask_var_t)kzalloc_node(cpumask_size(),
> +						    GFP_KERNEL, cpu_to_node(i));
>  	}
>  #endif /* CONFIG_CPUMASK_OFFSTACK */
>  

That doesn't actually apply; I've made it:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9748,6 +9748,7 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+		zalloc_cpumask_var_node(&per_cpu(runqueues.scratch_mask, i), GFP_KERNEL, cpu_to_node(i));
 	}
 
 	set_load_weight(&init_task, false);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races
  2022-09-22 18:00 ` [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races Waiman Long
@ 2022-10-07 12:47   ` Peter Zijlstra
  2022-10-07 18:59     ` Waiman Long
  2022-10-28  6:42   ` [tip: sched/core] sched: Introduce affinity_context tip-bot2 for Waiman Long
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2022-10-07 12:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On Thu, Sep 22, 2022 at 02:00:40PM -0400, Waiman Long wrote:
> Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity()
> or between multiple sched_setaffinity() calls from different
> CPUs. To resolve these race conditions, we need to update both
> user_cpus_ptr and cpus_mask in a single lock critical section instead
> of separated ones. This requires moving the user_cpus_ptr update
> to set_cpus_allowed_common() by putting the user_mask into a new
> affinity_context structure and using it to pass information around
> various functions.
> 
> This patch also changes the handling of the race between the
> sched_setaffinity() call and the changing of cpumask of the current
> cpuset. In case the new mask conflicts with newly updated cpuset,
> the cpus_mask will be reset to the cpuset cpumask and an error value
> of -EINVAL will be returned. If a previous user_cpus_ptr value exists,
> it will be swapped back in and the new_mask will be further restricted
> to what is allowed in the cpumask pointed to by the old user_cpus_ptr.
> 
> The potential race between sched_setaffinity() and a fork/clone()
> syscall calling dup_user_cpus_ptr() is also being handled.

This is still arse-backwards... You're still fixing races you've
introduced earlier in the series.

Since I don't think telling you again is going to help; I've done it for
you :/ How's this then?

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/affinity



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 3/5] sched: Enforce user requested affinity
  2022-10-07 10:01   ` Peter Zijlstra
@ 2022-10-07 14:57     ` Waiman Long
  2022-10-07 15:23       ` Waiman Long
  0 siblings, 1 reply; 25+ messages in thread
From: Waiman Long @ 2022-10-07 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan


On 10/7/22 06:01, Peter Zijlstra wrote:
> On Thu, Sep 22, 2022 at 02:00:39PM -0400, Waiman Long wrote:
>> @@ -9647,6 +9656,9 @@ void __init sched_init(void)
>>   			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>   		per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
>>   			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>> +		per_cpu(runqueues.scratch_mask, i) =
>> +			(cpumask_var_t)kzalloc_node(cpumask_size(),
>> +						    GFP_KERNEL, cpu_to_node(i));
>>   	}
>>   #endif /* CONFIG_CPUMASK_OFFSTACK */
>>   
> That doesn't actually apply; I've made it:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9748,6 +9748,7 @@ void __init sched_init(void)
>   
>   		rq->core_cookie = 0UL;
>   #endif
> +		zalloc_cpumask_var_node(&per_cpu(runqueues.scratch_mask, i), GFP_KERNEL, cpu_to_node(i));
>   	}
>   
>   	set_load_weight(&init_task, false);
>
Sorry, I should have worked on the latest tip tree instead.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 3/5] sched: Enforce user requested affinity
  2022-10-07 14:57     ` Waiman Long
@ 2022-10-07 15:23       ` Waiman Long
  0 siblings, 0 replies; 25+ messages in thread
From: Waiman Long @ 2022-10-07 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On 10/7/22 10:57, Waiman Long wrote:
>
> On 10/7/22 06:01, Peter Zijlstra wrote:
>> On Thu, Sep 22, 2022 at 02:00:39PM -0400, Waiman Long wrote:
>>> @@ -9647,6 +9656,9 @@ void __init sched_init(void)
>>>               cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>>           per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
>>>               cpumask_size(), GFP_KERNEL, cpu_to_node(i));
>>> +        per_cpu(runqueues.scratch_mask, i) =
>>> +            (cpumask_var_t)kzalloc_node(cpumask_size(),
>>> +                            GFP_KERNEL, cpu_to_node(i));
>>>       }
>>>   #endif /* CONFIG_CPUMASK_OFFSTACK */
>> That doesn't actually apply; I've made it:
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9748,6 +9748,7 @@ void __init sched_init(void)
>>             rq->core_cookie = 0UL;
>>   #endif
>> + zalloc_cpumask_var_node(&per_cpu(runqueues.scratch_mask, i), 
>> GFP_KERNEL, cpu_to_node(i));
>>       }
>>         set_load_weight(&init_task, false);
>>
> Sorry, I should have worked on the latest tip tree instead.

To be consistent with the surround context, it may be better to change 
it to

+        zalloc_cpumask_var_node(rq->scratch_mask, GFP_KERNEL, 
cpu_to_node(i));

Cheers,
Longman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races
  2022-10-07 12:47   ` Peter Zijlstra
@ 2022-10-07 18:59     ` Waiman Long
  0 siblings, 0 replies; 25+ messages in thread
From: Waiman Long @ 2022-10-07 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	Linus Torvalds, Lai Jiangshan

On 10/7/22 08:47, Peter Zijlstra wrote:
> On Thu, Sep 22, 2022 at 02:00:40PM -0400, Waiman Long wrote:
>> Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity()
>> or between multiple sched_setaffinity() calls from different
>> CPUs. To resolve these race conditions, we need to update both
>> user_cpus_ptr and cpus_mask in a single lock critical section instead
>> of separated ones. This requires moving the user_cpus_ptr update
>> to set_cpus_allowed_common() by putting the user_mask into a new
>> affinity_context structure and using it to pass information around
>> various functions.
>>
>> This patch also changes the handling of the race between the
>> sched_setaffinity() call and the changing of cpumask of the current
>> cpuset. In case the new mask conflicts with newly updated cpuset,
>> the cpus_mask will be reset to the cpuset cpumask and an error value
>> of -EINVAL will be returned. If a previous user_cpus_ptr value exists,
>> it will be swapped back in and the new_mask will be further restricted
>> to what is allowed in the cpumask pointed to by the old user_cpus_ptr.
>>
>> The potential race between sched_setaffinity() and a fork/clone()
>> syscall calling dup_user_cpus_ptr() is also being handled.
> This is still arse-backwards... You're still fixing races you've
> introduced earlier in the series.
>
> Since I don't think telling you again is going to help; I've done it for
> you :/ How's this then?
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/affinity
>
Thank you very much for updating the patch series. Beside the minor nit 
that I talked about in the previous mail, the result looks good to me. 
Do you mind if I send another patch on top of your branch to make the 
adjustment or you want to do it yourself?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
  2022-09-22 18:00 ` [PATCH v10 5/5] sched: Always clear user_cpus_ptr in do_set_cpus_allowed() Waiman Long
@ 2022-10-28  6:42   ` tip-bot2 for Waiman Long
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Waiman Long @ 2022-10-28  6:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra, Waiman Long, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     851a723e45d1c4c8f6f7b0d2cfbc5f53690bb4e9
Gitweb:        https://git.kernel.org/tip/851a723e45d1c4c8f6f7b0d2cfbc5f53690bb4e9
Author:        Waiman Long <longman@redhat.com>
AuthorDate:    Thu, 22 Sep 2022 14:00:41 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 27 Oct 2022 11:01:22 +02:00

sched: Always clear user_cpus_ptr in do_set_cpus_allowed()

The do_set_cpus_allowed() function is used by either kthread_bind() or
select_fallback_rq(). In both cases the user affinity (if any) should be
destroyed too.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-6-longman@redhat.com
---
 kernel/sched/core.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 283bdbd..87c9cdf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2593,14 +2593,20 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		set_next_task(rq, p);
 }
 
+/*
+ * Used for kthread_bind() and select_fallback_rq(), in both cases the user
+ * affinity (if any) should be destroyed too.
+ */
 void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
 	struct affinity_context ac = {
 		.new_mask  = new_mask,
-		.flags     = 0,
+		.user_mask = NULL,
+		.flags     = SCA_USER,	/* clear the user requested mask */
 	};
 
 	__do_set_cpus_allowed(p, &ac);
+	kfree(ac.user_mask);
 }
 
 int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched: Enforce user requested affinity
  2022-09-22 18:00 ` [PATCH v10 3/5] sched: Enforce user requested affinity Waiman Long
  2022-10-07 10:01   ` Peter Zijlstra
@ 2022-10-28  6:42   ` tip-bot2 for Waiman Long
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Waiman Long @ 2022-10-28  6:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Waiman Long, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     da019032819a1f09943d3af676892ec8c627668e
Gitweb:        https://git.kernel.org/tip/da019032819a1f09943d3af676892ec8c627668e
Author:        Waiman Long <longman@redhat.com>
AuthorDate:    Thu, 22 Sep 2022 14:00:39 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 27 Oct 2022 11:01:22 +02:00

sched: Enforce user requested affinity

It was found that the user requested affinity via sched_setaffinity()
can be easily overwritten by other kernel subsystems without an easy way
to reset it back to what the user requested. For example, any change
to the current cpuset hierarchy may reset the cpumask of the tasks in
the affected cpusets to the default cpuset value even if those tasks
have pre-existing user requested affinity. That is especially easy to
trigger under a cgroup v2 environment where writing "+cpuset" to the
root cgroup's cgroup.subtree_control file will reset the cpus affinity
of all the processes in the system.

That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.

Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
and use it to restrcit the given cpumask unless there is no overlap. In
that case, it will fallback to the given one. The SCA_USER flag is
reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr
masking should be skipped. In addition, masking should also be skipped
if any of the SCA_MIGRATE_* flag is set.

All callers of set_cpus_allowed_ptr() will be affected by this change.
A scratch cpumask is added to percpu runqueues structure for doing
additional masking when user_cpus_ptr is set.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-4-longman@redhat.com
---
 kernel/sched/core.c  | 10 ++++++++++
 kernel/sched/sched.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67fb0e4..283bdbd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2949,6 +2949,15 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &rf);
+	/*
+	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
+	 * flags are set.
+	 */
+	if (p->user_cpus_ptr &&
+	    !(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
+	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
+		ctx->new_mask = rq->scratch_mask;
+
 	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
 }
 
@@ -9804,6 +9813,7 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04f571d..771f8dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1151,6 +1151,9 @@ struct rq {
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
 #endif
+
+	/* Scratch cpumask to be temporarily used under rq_lock */
+	cpumask_var_t		scratch_mask;
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched: Always preserve the user requested cpumask
  2022-09-22 18:00 ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
@ 2022-10-28  6:42   ` tip-bot2 for Waiman Long
  2023-01-17 16:08   ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Will Deacon
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Waiman Long @ 2022-10-28  6:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Waiman Long, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8f9ea86fdf99b81458cc21fc1c591fcd4a0fa1f4
Gitweb:        https://git.kernel.org/tip/8f9ea86fdf99b81458cc21fc1c591fcd4a0fa1f4
Author:        Waiman Long <longman@redhat.com>
AuthorDate:    Thu, 22 Sep 2022 14:00:38 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 27 Oct 2022 11:01:22 +02:00

sched: Always preserve the user requested cpumask

Unconditionally preserve the user requested cpumask on
sched_setaffinity() calls. This allows using it outside of the fairly
narrow restrict_cpus_allowed_ptr() use-case and fix some cpuset issues
that currently suffer destruction of cpumasks.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-3-longman@redhat.com
---
 kernel/sched/core.c  | 119 ++++++++++++++++++++++--------------------
 kernel/sched/sched.h |   8 +++-
 2 files changed, 72 insertions(+), 55 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ad4e2e..67fb0e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2540,6 +2540,12 @@ void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx
 
 	cpumask_copy(&p->cpus_mask, ctx->new_mask);
 	p->nr_cpus_allowed = cpumask_weight(ctx->new_mask);
+
+	/*
+	 * Swap in a new user_cpus_ptr if SCA_USER flag set
+	 */
+	if (ctx->flags & SCA_USER)
+		swap(p->user_cpus_ptr, ctx->user_mask);
 }
 
 static void
@@ -2600,6 +2606,8 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
 		      int node)
 {
+	unsigned long flags;
+
 	if (!src->user_cpus_ptr)
 		return 0;
 
@@ -2607,7 +2615,10 @@ int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
 	if (!dst->user_cpus_ptr)
 		return -ENOMEM;
 
+	/* Use pi_lock to protect content of user_cpus_ptr */
+	raw_spin_lock_irqsave(&src->pi_lock, flags);
 	cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+	raw_spin_unlock_irqrestore(&src->pi_lock, flags);
 	return 0;
 }
 
@@ -2856,7 +2867,6 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
 	const struct cpumask *cpu_valid_mask = cpu_active_mask;
 	bool kthread = p->flags & PF_KTHREAD;
-	struct cpumask *user_mask = NULL;
 	unsigned int dest_cpu;
 	int ret = 0;
 
@@ -2915,14 +2925,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, ctx);
 
-	if (ctx->flags & SCA_USER)
-		user_mask = clear_user_cpus_ptr(p);
-
-	ret = affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
-
-	kfree(user_mask);
-
-	return ret;
+	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
 out:
 	task_rq_unlock(rq, p, rf);
@@ -2962,8 +2965,10 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
  * Change a given task's CPU affinity to the intersection of its current
- * affinity mask and @subset_mask, writing the resulting mask to @new_mask
- * and pointing @p->user_cpus_ptr to a copy of the old mask.
+ * affinity mask and @subset_mask, writing the resulting mask to @new_mask.
+ * If user_cpus_ptr is defined, use it as the basis for restricting CPU
+ * affinity or use cpu_online_mask instead.
+ *
  * If the resulting mask is empty, leave the affinity unchanged and return
  * -EINVAL.
  */
@@ -2971,18 +2976,14 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 				     struct cpumask *new_mask,
 				     const struct cpumask *subset_mask)
 {
-	struct cpumask *user_mask = NULL;
-	struct affinity_context ac;
+	struct affinity_context ac = {
+		.new_mask  = new_mask,
+		.flags     = 0,
+	};
 	struct rq_flags rf;
 	struct rq *rq;
 	int err;
 
-	if (!p->user_cpus_ptr) {
-		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
-		if (!user_mask)
-			return -ENOMEM;
-	}
-
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -2995,29 +2996,15 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 		goto err_unlock;
 	}
 
-	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
+	if (!cpumask_and(new_mask, task_user_cpus(p), subset_mask)) {
 		err = -EINVAL;
 		goto err_unlock;
 	}
 
-	/*
-	 * We're about to butcher the task affinity, so keep track of what
-	 * the user asked for in case we're able to restore it later on.
-	 */
-	if (user_mask) {
-		cpumask_copy(user_mask, p->cpus_ptr);
-		p->user_cpus_ptr = user_mask;
-	}
-
-	ac = (struct affinity_context){
-		.new_mask = new_mask,
-	};
-
 	return __set_cpus_allowed_ptr_locked(p, &ac, rq, &rf);
 
 err_unlock:
 	task_rq_unlock(rq, p, &rf);
-	kfree(user_mask);
 	return err;
 }
 
@@ -3071,33 +3058,25 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
 
 /*
  * Restore the affinity of a task @p which was previously restricted by a
- * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
- * @p->user_cpus_ptr.
+ * call to force_compatible_cpus_allowed_ptr().
  *
  * It is the caller's responsibility to serialise this with any calls to
  * force_compatible_cpus_allowed_ptr(@p).
  */
 void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 {
-	struct cpumask *user_mask = p->user_cpus_ptr;
 	struct affinity_context ac = {
-		.new_mask  = user_mask,
+		.new_mask  = task_user_cpus(p),
+		.flags     = 0,
 	};
-	unsigned long flags;
+	int ret;
 
 	/*
-	 * Try to restore the old affinity mask. If this fails, then
-	 * we free the mask explicitly to avoid it being inherited across
-	 * a subsequent fork().
+	 * Try to restore the old affinity mask with __sched_setaffinity().
+	 * Cpuset masking will be done there too.
 	 */
-	if (!user_mask || !__sched_setaffinity(p, &ac))
-		return;
-
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	user_mask = clear_user_cpus_ptr(p);
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-
-	kfree(user_mask);
+	ret = __sched_setaffinity(p, &ac);
+	WARN_ON_ONCE(ret);
 }
 
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
@@ -8136,7 +8115,7 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx)
 	retval = dl_task_check_affinity(p, new_mask);
 	if (retval)
 		goto out_free_new_mask;
-again:
+
 	retval = __set_cpus_allowed_ptr(p, ctx);
 	if (retval)
 		goto out_free_new_mask;
@@ -8148,7 +8127,24 @@ again:
 		 * Just reset the cpumask to the cpuset's cpus_allowed.
 		 */
 		cpumask_copy(new_mask, cpus_allowed);
-		goto again;
+
+		/*
+		 * If SCA_USER is set, a 2nd call to __set_cpus_allowed_ptr()
+		 * will restore the previous user_cpus_ptr value.
+		 *
+		 * In the unlikely event a previous user_cpus_ptr exists,
+		 * we need to further restrict the mask to what is allowed
+		 * by that old user_cpus_ptr.
+		 */
+		if (unlikely((ctx->flags & SCA_USER) && ctx->user_mask)) {
+			bool empty = !cpumask_and(new_mask, new_mask,
+						  ctx->user_mask);
+
+			if (WARN_ON_ONCE(empty))
+				cpumask_copy(new_mask, cpus_allowed);
+		}
+		__set_cpus_allowed_ptr(p, ctx);
+		retval = -EINVAL;
 	}
 
 out_free_new_mask:
@@ -8160,9 +8156,8 @@ out_free_cpus_allowed:
 
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
-	struct affinity_context ac = {
-		.new_mask = in_mask,
-	};
+	struct affinity_context ac;
+	struct cpumask *user_mask;
 	struct task_struct *p;
 	int retval;
 
@@ -8197,7 +8192,21 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_put_task;
 
+	user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
+	if (!user_mask) {
+		retval = -ENOMEM;
+		goto out_put_task;
+	}
+	cpumask_copy(user_mask, in_mask);
+	ac = (struct affinity_context){
+		.new_mask  = in_mask,
+		.user_mask = user_mask,
+		.flags     = SCA_USER,
+	};
+
 	retval = __sched_setaffinity(p, &ac);
+	kfree(ac.user_mask);
+
 out_put_task:
 	put_task_struct(p);
 	return retval;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6c91fb7..04f571d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1878,6 +1878,13 @@ static inline void dirty_sched_domain_sysctl(int cpu)
 #endif
 
 extern int sched_update_scaling(void);
+
+static inline const struct cpumask *task_user_cpus(struct task_struct *p)
+{
+	if (!p->user_cpus_ptr)
+		return cpu_possible_mask; /* &init_task.cpus_mask */
+	return p->user_cpus_ptr;
+}
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
@@ -2147,6 +2154,7 @@ extern const u32		sched_prio_to_wmult[40];
 
 struct affinity_context {
 	const struct cpumask *new_mask;
+	struct cpumask *user_mask;
 	unsigned int flags;
 };
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched: Introduce affinity_context
  2022-09-22 18:00 ` [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races Waiman Long
  2022-10-07 12:47   ` Peter Zijlstra
@ 2022-10-28  6:42   ` tip-bot2 for Waiman Long
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Waiman Long @ 2022-10-28  6:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra, Waiman Long, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     713a2e21a5137e96d2594f53d19784ffde3ddbd0
Gitweb:        https://git.kernel.org/tip/713a2e21a5137e96d2594f53d19784ffde3ddbd0
Author:        Waiman Long <longman@redhat.com>
AuthorDate:    Thu, 22 Sep 2022 14:00:40 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 27 Oct 2022 11:01:21 +02:00

sched: Introduce affinity_context

In order to prepare for passing through additional data through the
affinity call-chains, convert the mask and flags argument into a
structure.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-5-longman@redhat.com
---
 kernel/sched/core.c     | 114 +++++++++++++++++++++++++--------------
 kernel/sched/deadline.c |   7 +--
 kernel/sched/sched.h    |  11 ++--
 3 files changed, 85 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f6f2807..5ad4e2e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2189,14 +2189,18 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 #ifdef CONFIG_SMP
 
 static void
-__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32 flags);
+__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx);
 
 static int __set_cpus_allowed_ptr(struct task_struct *p,
-				  const struct cpumask *new_mask,
-				  u32 flags);
+				  struct affinity_context *ctx);
 
 static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 {
+	struct affinity_context ac = {
+		.new_mask  = cpumask_of(rq->cpu),
+		.flags     = SCA_MIGRATE_DISABLE,
+	};
+
 	if (likely(!p->migration_disabled))
 		return;
 
@@ -2206,7 +2210,7 @@ static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 	/*
 	 * Violates locking rules! see comment in __do_set_cpus_allowed().
 	 */
-	__do_set_cpus_allowed(p, cpumask_of(rq->cpu), SCA_MIGRATE_DISABLE);
+	__do_set_cpus_allowed(p, &ac);
 }
 
 void migrate_disable(void)
@@ -2228,6 +2232,10 @@ EXPORT_SYMBOL_GPL(migrate_disable);
 void migrate_enable(void)
 {
 	struct task_struct *p = current;
+	struct affinity_context ac = {
+		.new_mask  = &p->cpus_mask,
+		.flags     = SCA_MIGRATE_ENABLE,
+	};
 
 	if (p->migration_disabled > 1) {
 		p->migration_disabled--;
@@ -2243,7 +2251,7 @@ void migrate_enable(void)
 	 */
 	preempt_disable();
 	if (p->cpus_ptr != &p->cpus_mask)
-		__set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE);
+		__set_cpus_allowed_ptr(p, &ac);
 	/*
 	 * Mustn't clear migration_disabled() until cpus_ptr points back at the
 	 * regular cpus_mask, otherwise things that race (eg.
@@ -2523,19 +2531,19 @@ out_unlock:
  * sched_class::set_cpus_allowed must do the below, but is not required to
  * actually call this function.
  */
-void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask, u32 flags)
+void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx)
 {
-	if (flags & (SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) {
-		p->cpus_ptr = new_mask;
+	if (ctx->flags & (SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) {
+		p->cpus_ptr = ctx->new_mask;
 		return;
 	}
 
-	cpumask_copy(&p->cpus_mask, new_mask);
-	p->nr_cpus_allowed = cpumask_weight(new_mask);
+	cpumask_copy(&p->cpus_mask, ctx->new_mask);
+	p->nr_cpus_allowed = cpumask_weight(ctx->new_mask);
 }
 
 static void
-__do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32 flags)
+__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 {
 	struct rq *rq = task_rq(p);
 	bool queued, running;
@@ -2552,7 +2560,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 	 *
 	 * XXX do further audits, this smells like something putrid.
 	 */
-	if (flags & SCA_MIGRATE_DISABLE)
+	if (ctx->flags & SCA_MIGRATE_DISABLE)
 		SCHED_WARN_ON(!p->on_cpu);
 	else
 		lockdep_assert_held(&p->pi_lock);
@@ -2571,7 +2579,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 	if (running)
 		put_prev_task(rq, p);
 
-	p->sched_class->set_cpus_allowed(p, new_mask, flags);
+	p->sched_class->set_cpus_allowed(p, ctx);
 
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@@ -2581,7 +2589,12 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 
 void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
-	__do_set_cpus_allowed(p, new_mask, 0);
+	struct affinity_context ac = {
+		.new_mask  = new_mask,
+		.flags     = 0,
+	};
+
+	__do_set_cpus_allowed(p, &ac);
 }
 
 int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
@@ -2834,8 +2847,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
  * Called with both p->pi_lock and rq->lock held; drops both before returning.
  */
 static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
-					 const struct cpumask *new_mask,
-					 u32 flags,
+					 struct affinity_context *ctx,
 					 struct rq *rq,
 					 struct rq_flags *rf)
 	__releases(rq->lock)
@@ -2864,7 +2876,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 		cpu_valid_mask = cpu_online_mask;
 	}
 
-	if (!kthread && !cpumask_subset(new_mask, cpu_allowed_mask)) {
+	if (!kthread && !cpumask_subset(ctx->new_mask, cpu_allowed_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -2873,18 +2885,18 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	 * Must re-check here, to close a race against __kthread_bind(),
 	 * sched_setaffinity() is not guaranteed to observe the flag.
 	 */
-	if ((flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
+	if ((ctx->flags & SCA_CHECK) && (p->flags & PF_NO_SETAFFINITY)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (!(flags & SCA_MIGRATE_ENABLE)) {
-		if (cpumask_equal(&p->cpus_mask, new_mask))
+	if (!(ctx->flags & SCA_MIGRATE_ENABLE)) {
+		if (cpumask_equal(&p->cpus_mask, ctx->new_mask))
 			goto out;
 
 		if (WARN_ON_ONCE(p == current &&
 				 is_migration_disabled(p) &&
-				 !cpumask_test_cpu(task_cpu(p), new_mask))) {
+				 !cpumask_test_cpu(task_cpu(p), ctx->new_mask))) {
 			ret = -EBUSY;
 			goto out;
 		}
@@ -2895,18 +2907,18 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 	 * for groups of tasks (ie. cpuset), so that load balancing is not
 	 * immediately required to distribute the tasks within their new mask.
 	 */
-	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, new_mask);
+	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
 	if (dest_cpu >= nr_cpu_ids) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	__do_set_cpus_allowed(p, new_mask, flags);
+	__do_set_cpus_allowed(p, ctx);
 
-	if (flags & SCA_USER)
+	if (ctx->flags & SCA_USER)
 		user_mask = clear_user_cpus_ptr(p);
 
-	ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+	ret = affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
 	kfree(user_mask);
 
@@ -2928,18 +2940,23 @@ out:
  * call is not atomic; no spinlocks may be held.
  */
 static int __set_cpus_allowed_ptr(struct task_struct *p,
-				  const struct cpumask *new_mask, u32 flags)
+				  struct affinity_context *ctx)
 {
 	struct rq_flags rf;
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &rf);
-	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
 }
 
 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 {
-	return __set_cpus_allowed_ptr(p, new_mask, 0);
+	struct affinity_context ac = {
+		.new_mask  = new_mask,
+		.flags     = 0,
+	};
+
+	return __set_cpus_allowed_ptr(p, &ac);
 }
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
@@ -2955,6 +2972,7 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 				     const struct cpumask *subset_mask)
 {
 	struct cpumask *user_mask = NULL;
+	struct affinity_context ac;
 	struct rq_flags rf;
 	struct rq *rq;
 	int err;
@@ -2991,7 +3009,11 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
 		p->user_cpus_ptr = user_mask;
 	}
 
-	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
+	ac = (struct affinity_context){
+		.new_mask = new_mask,
+	};
+
+	return __set_cpus_allowed_ptr_locked(p, &ac, rq, &rf);
 
 err_unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -3045,7 +3067,7 @@ out_free_mask:
 }
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
+__sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
 
 /*
  * Restore the affinity of a task @p which was previously restricted by a
@@ -3058,6 +3080,9 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
 void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 {
 	struct cpumask *user_mask = p->user_cpus_ptr;
+	struct affinity_context ac = {
+		.new_mask  = user_mask,
+	};
 	unsigned long flags;
 
 	/*
@@ -3065,7 +3090,7 @@ void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 	 * we free the mask explicitly to avoid it being inherited across
 	 * a subsequent fork().
 	 */
-	if (!user_mask || !__sched_setaffinity(p, user_mask))
+	if (!user_mask || !__sched_setaffinity(p, &ac))
 		return;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
@@ -3550,10 +3575,9 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
 #else /* CONFIG_SMP */
 
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
-					 const struct cpumask *new_mask,
-					 u32 flags)
+					 struct affinity_context *ctx)
 {
-	return set_cpus_allowed_ptr(p, new_mask);
+	return set_cpus_allowed_ptr(p, ctx->new_mask);
 }
 
 static inline void migrate_disable_switch(struct rq *rq, struct task_struct *p) { }
@@ -8090,7 +8114,7 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask)
 #endif
 
 static int
-__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
+__sched_setaffinity(struct task_struct *p, struct affinity_context *ctx)
 {
 	int retval;
 	cpumask_var_t cpus_allowed, new_mask;
@@ -8104,13 +8128,16 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	}
 
 	cpuset_cpus_allowed(p, cpus_allowed);
-	cpumask_and(new_mask, mask, cpus_allowed);
+	cpumask_and(new_mask, ctx->new_mask, cpus_allowed);
+
+	ctx->new_mask = new_mask;
+	ctx->flags |= SCA_CHECK;
 
 	retval = dl_task_check_affinity(p, new_mask);
 	if (retval)
 		goto out_free_new_mask;
 again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
+	retval = __set_cpus_allowed_ptr(p, ctx);
 	if (retval)
 		goto out_free_new_mask;
 
@@ -8133,6 +8160,9 @@ out_free_cpus_allowed:
 
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
+	struct affinity_context ac = {
+		.new_mask = in_mask,
+	};
 	struct task_struct *p;
 	int retval;
 
@@ -8167,7 +8197,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_put_task;
 
-	retval = __sched_setaffinity(p, in_mask);
+	retval = __sched_setaffinity(p, &ac);
 out_put_task:
 	put_task_struct(p);
 	return retval;
@@ -8948,6 +8978,12 @@ void show_state_filter(unsigned int state_filter)
  */
 void __init init_idle(struct task_struct *idle, int cpu)
 {
+#ifdef CONFIG_SMP
+	struct affinity_context ac = (struct affinity_context) {
+		.new_mask  = cpumask_of(cpu),
+		.flags     = 0,
+	};
+#endif
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
@@ -8972,7 +9008,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	 *
 	 * And since this is boot we can forgo the serialization.
 	 */
-	set_cpus_allowed_common(idle, cpumask_of(cpu), 0);
+	set_cpus_allowed_common(idle, &ac);
 #endif
 	/*
 	 * We're having a chicken and egg problem, even though we are
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9ae8f41..0d97d54 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2485,8 +2485,7 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
 }
 
 static void set_cpus_allowed_dl(struct task_struct *p,
-				const struct cpumask *new_mask,
-				u32 flags)
+				struct affinity_context *ctx)
 {
 	struct root_domain *src_rd;
 	struct rq *rq;
@@ -2501,7 +2500,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 	 * update. We already made space for us in the destination
 	 * domain (see cpuset_can_attach()).
 	 */
-	if (!cpumask_intersects(src_rd->span, new_mask)) {
+	if (!cpumask_intersects(src_rd->span, ctx->new_mask)) {
 		struct dl_bw *src_dl_b;
 
 		src_dl_b = dl_bw_of(cpu_of(rq));
@@ -2515,7 +2514,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 		raw_spin_unlock(&src_dl_b->lock);
 	}
 
-	set_cpus_allowed_common(p, new_mask, flags);
+	set_cpus_allowed_common(p, ctx);
 }
 
 /* Assumes rq->lock is held */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f18460..6c91fb7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2145,6 +2145,11 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+struct affinity_context {
+	const struct cpumask *new_mask;
+	unsigned int flags;
+};
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -2173,9 +2178,7 @@ struct sched_class {
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
 
-	void (*set_cpus_allowed)(struct task_struct *p,
-				 const struct cpumask *newmask,
-				 u32 flags);
+	void (*set_cpus_allowed)(struct task_struct *p, struct affinity_context *ctx);
 
 	void (*rq_online)(struct rq *rq);
 	void (*rq_offline)(struct rq *rq);
@@ -2286,7 +2289,7 @@ extern void update_group_capacity(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
 
-extern void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask, u32 flags);
+extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx);
 
 static inline struct task_struct *get_push_task(struct rq *rq)
 {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched: Add __releases annotations to affine_move_task()
  2022-09-22 18:00 ` [PATCH v10 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
@ 2022-10-28  6:42   ` tip-bot2 for Waiman Long
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Waiman Long @ 2022-10-28  6:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Waiman Long, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5584e8ac2c68280e5ac31d231c23cdb7dfa225db
Gitweb:        https://git.kernel.org/tip/5584e8ac2c68280e5ac31d231c23cdb7dfa225db
Author:        Waiman Long <longman@redhat.com>
AuthorDate:    Thu, 22 Sep 2022 14:00:37 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 27 Oct 2022 11:01:21 +02:00

sched: Add __releases annotations to affine_move_task()

affine_move_task() assumes task_rq_lock() has been called and it does
an implicit task_rq_unlock() before returning. Add the appropriate
__releases annotations to make this clear.

A typo error in comment is also fixed.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220922180041.1768141-2-longman@redhat.com
---
 kernel/sched/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 069da4a..f6f2807 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2690,6 +2690,8 @@ void release_user_cpus_ptr(struct task_struct *p)
  */
 static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf,
 			    int dest_cpu, unsigned int flags)
+	__releases(rq->lock)
+	__releases(p->pi_lock)
 {
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
@@ -2999,7 +3001,7 @@ err_unlock:
 
 /*
  * Restrict the CPU affinity of task @p so that it is a subset of
- * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the
+ * task_cpu_possible_mask() and point @p->user_cpus_ptr to a copy of the
  * old affinity mask. If the resulting mask is empty, we warn and walk
  * up the cpuset hierarchy until we find a suitable mask.
  */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2022-09-22 18:00 ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
  2022-10-28  6:42   ` [tip: sched/core] sched: Always preserve the user requested cpumask tip-bot2 for Waiman Long
@ 2023-01-17 16:08   ` Will Deacon
  2023-01-17 18:13     ` Waiman Long
                       ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Will Deacon @ 2023-01-17 16:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

Hi Waiman,

On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
> is currently used only by arm64 arch due to possible asymmetric CPU
> setup. This patch extends its usage to save user provided cpumask
> when sched_setaffinity() is called for all arches. With this patch
> applied, user_cpus_ptr, once allocated after a successful call to
> sched_setaffinity(), will only be freed when the task exits.
> 
> Since user_cpus_ptr is supposed to be used for "requested
> affinity", there is actually no point to save current cpu affinity in
> restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
> Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
> it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
> if defined but not changing it.
> 
> This will be some changes in behavior for arm64 systems with asymmetric
> CPUs in some corner cases. For instance, if sched_setaffinity()
> has never been called and there is a cpuset change before
> relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
> follow what the cpuset allows but not what the previous cpu affinity
> setting allows.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sched/core.c  | 82 ++++++++++++++++++++------------------------
>  kernel/sched/sched.h |  7 ++++
>  2 files changed, 44 insertions(+), 45 deletions(-)

We've tracked this down as the cause of an arm64 regression in Android and I've
reproduced the issue with mainline.

Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
the command-line, then the arch code will (amongst other things) call
force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
when exec()'ing a 32-bit or a 64-bit task respectively.

If you consider a system where everything is 64-bit but the cmdline option
above is present, then the call to relax_compatible_cpus_allowed_ptr() isn't
expected to do anything in this case, and the old code made sure of that:

> @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
>  
>  /*
>   * Restore the affinity of a task @p which was previously restricted by a
> - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
> - * @p->user_cpus_ptr.
> + * call to force_compatible_cpus_allowed_ptr().
>   *
>   * It is the caller's responsibility to serialise this with any calls to
>   * force_compatible_cpus_allowed_ptr(@p).
>   */
>  void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>  {
> -	struct cpumask *user_mask = p->user_cpus_ptr;
> -	unsigned long flags;
> +	int ret;
>  
>  	/*
> -	 * Try to restore the old affinity mask. If this fails, then
> -	 * we free the mask explicitly to avoid it being inherited across
> -	 * a subsequent fork().
> +	 * Try to restore the old affinity mask with __sched_setaffinity().
> +	 * Cpuset masking will be done there too.
>  	 */
> -	if (!user_mask || !__sched_setaffinity(p, user_mask))
> -		return;

... since it returned early here if '!user_mask' ...

> -
> -	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	user_mask = clear_user_cpus_ptr(p);
> -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> -
> -	kfree(user_mask);
> +	ret = __sched_setaffinity(p, task_user_cpus(p));
> +	WARN_ON_ONCE(ret);

... however, now we end up going down into __sched_setaffinity() with
task_user_cpus(p) giving us the 'cpu_possible_mask'! This can lead to a mixture
of WARN_ON()s and incorrect affinity masks (for example, a newly exec'd task
ends up with the affinity mask of the online CPUs at the point of exec() and is
unable to run on anything onlined later).

I've had a crack at fixing the code above to restore the old behaviour, and it
seems to work for my basic tests (still pending confirmation from others):


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bb1ee6d7bdde..0d4a11384648 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3125,17 +3125,16 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
 void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 {
        struct affinity_context ac = {
-               .new_mask  = task_user_cpus(p),
+               .new_mask  = p->user_cpus_ptr,
                .flags     = 0,
        };
-       int ret;
 
        /*
         * Try to restore the old affinity mask with __sched_setaffinity().
         * Cpuset masking will be done there too.
         */
-       ret = __sched_setaffinity(p, &ac);
-       WARN_ON_ONCE(ret);
+       if (ac.new_mask)
+               WARN_ON_ONCE(__sched_setaffinity(p, &ac));
 }
 
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)


With this change, task_user_cpus() is only used by restrict_cpus_allowed_ptr()
so I'd be inclined to remove it altogether tbh.

What do you think?

Will

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-17 16:08   ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Will Deacon
@ 2023-01-17 18:13     ` Waiman Long
  2023-01-20 17:59       ` Will Deacon
  2023-01-26 12:52     ` Linux kernel regression tracking (#adding)
  2023-01-27 18:36     ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Waiman Long @ 2023-01-17 18:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

On 1/17/23 11:08, Will Deacon wrote:
> Hi Waiman,
>
> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>> is currently used only by arm64 arch due to possible asymmetric CPU
>> setup. This patch extends its usage to save user provided cpumask
>> when sched_setaffinity() is called for all arches. With this patch
>> applied, user_cpus_ptr, once allocated after a successful call to
>> sched_setaffinity(), will only be freed when the task exits.
>>
>> Since user_cpus_ptr is supposed to be used for "requested
>> affinity", there is actually no point to save current cpu affinity in
>> restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
>> Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
>> it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>> if defined but not changing it.
>>
>> This will be some changes in behavior for arm64 systems with asymmetric
>> CPUs in some corner cases. For instance, if sched_setaffinity()
>> has never been called and there is a cpuset change before
>> relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
>> follow what the cpuset allows but not what the previous cpu affinity
>> setting allows.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/sched/core.c  | 82 ++++++++++++++++++++------------------------
>>   kernel/sched/sched.h |  7 ++++
>>   2 files changed, 44 insertions(+), 45 deletions(-)
> We've tracked this down as the cause of an arm64 regression in Android and I've
> reproduced the issue with mainline.
>
> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
> the command-line, then the arch code will (amongst other things) call
> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
> when exec()'ing a 32-bit or a 64-bit task respectively.

IOW, relax_compatible_cpus_allowed_ptr() can be called without a 
previous force_compatible_cpus_allowed_ptr(). Right?

A possible optimization in this case is to add a bit flag in the 
task_struct to indicate a previous call to 
force_compatible_cpus_allowed_ptr(). Without that flag set, 
relax_compatible_cpus_allowed_ptr() can return immediately.

>
> If you consider a system where everything is 64-bit but the cmdline option
> above is present, then the call to relax_compatible_cpus_allowed_ptr() isn't
> expected to do anything in this case, and the old code made sure of that:
>
>> @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
>>   
>>   /*
>>    * Restore the affinity of a task @p which was previously restricted by a
>> - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
>> - * @p->user_cpus_ptr.
>> + * call to force_compatible_cpus_allowed_ptr().
>>    *
>>    * It is the caller's responsibility to serialise this with any calls to
>>    * force_compatible_cpus_allowed_ptr(@p).
>>    */
>>   void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>   {
>> -	struct cpumask *user_mask = p->user_cpus_ptr;
>> -	unsigned long flags;
>> +	int ret;
>>   
>>   	/*
>> -	 * Try to restore the old affinity mask. If this fails, then
>> -	 * we free the mask explicitly to avoid it being inherited across
>> -	 * a subsequent fork().
>> +	 * Try to restore the old affinity mask with __sched_setaffinity().
>> +	 * Cpuset masking will be done there too.
>>   	 */
>> -	if (!user_mask || !__sched_setaffinity(p, user_mask))
>> -		return;
> ... since it returned early here if '!user_mask' ...
The flag bit will work like the user_mask check here.
>
>> -
>> -	raw_spin_lock_irqsave(&p->pi_lock, flags);
>> -	user_mask = clear_user_cpus_ptr(p);
>> -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>> -
>> -	kfree(user_mask);
>> +	ret = __sched_setaffinity(p, task_user_cpus(p));
>> +	WARN_ON_ONCE(ret);
> ... however, now we end up going down into __sched_setaffinity() with
> task_user_cpus(p) giving us the 'cpu_possible_mask'! This can lead to a mixture
> of WARN_ON()s and incorrect affinity masks (for example, a newly exec'd task
> ends up with the affinity mask of the online CPUs at the point of exec() and is
> unable to run on anything onlined later).

CPU hotplug should update the cpumask of existing running application as 
allowed by cpuset.


>
> I've had a crack at fixing the code above to restore the old behaviour, and it
> seems to work for my basic tests (still pending confirmation from others):
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bb1ee6d7bdde..0d4a11384648 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3125,17 +3125,16 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
>   void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>   {
>          struct affinity_context ac = {
> -               .new_mask  = task_user_cpus(p),
> +               .new_mask  = p->user_cpus_ptr,
>                  .flags     = 0,
>          };
> -       int ret;
>   
>          /*
>           * Try to restore the old affinity mask with __sched_setaffinity().
>           * Cpuset masking will be done there too.
>           */
> -       ret = __sched_setaffinity(p, &ac);
> -       WARN_ON_ONCE(ret);
> +       if (ac.new_mask)
> +               WARN_ON_ONCE(__sched_setaffinity(p, &ac));
>   }
>   
>   void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>
>
> With this change, task_user_cpus() is only used by restrict_cpus_allowed_ptr()
> so I'd be inclined to remove it altogether tbh.
>
> What do you think?

The problem here is that force_compatible_cpus_allowed_ptr() can be 
called without a matching relax_compatible_cpus_allowed_ptr() at the 
end. So we may end up artificially restrict the number of cpus that can 
be used when running a 64-bit binary.

What do you think about the idea of having a bit flag to track that?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-17 18:13     ` Waiman Long
@ 2023-01-20 17:59       ` Will Deacon
  2023-01-20 18:10         ` Waiman Long
  0 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2023-01-20 17:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

Hey Waiman,

Cheers for the quick reply.

On Tue, Jan 17, 2023 at 01:13:31PM -0500, Waiman Long wrote:
> On 1/17/23 11:08, Will Deacon wrote:
> > On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
> > > The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
> > > Introduce task_struct::user_cpus_ptr to track requested affinity"). It
> > > is currently used only by arm64 arch due to possible asymmetric CPU
> > > setup. This patch extends its usage to save user provided cpumask
> > > when sched_setaffinity() is called for all arches. With this patch
> > > applied, user_cpus_ptr, once allocated after a successful call to
> > > sched_setaffinity(), will only be freed when the task exits.

[...]

> > We've tracked this down as the cause of an arm64 regression in Android and I've
> > reproduced the issue with mainline.
> > 
> > Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
> > the command-line, then the arch code will (amongst other things) call
> > force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
> > when exec()'ing a 32-bit or a 64-bit task respectively.
> 
> IOW, relax_compatible_cpus_allowed_ptr() can be called without a previous
> force_compatible_cpus_allowed_ptr(). Right?

In practice, these functions are only called by arm64 during exec. As above,
exec()'ing a 32-bit task calls force_compatible_cpus_allowed_ptr() and
exec()'ing a 64-bit task calls relax_compatible_cpus_allowed_ptr(). So
they don't come in pairs at all; it's just that calling relax_[...] should
try to restore the affinity mask if it was previously clobbered by
force_[...].

> A possible optimization in this case is to add a bit flag in the task_struct
> to indicate a previous call to force_compatible_cpus_allowed_ptr(). Without
> that flag set, relax_compatible_cpus_allowed_ptr() can return immediately.

How is this an optimisation over a pointer comparison?

> > I've had a crack at fixing the code above to restore the old behaviour, and it
> > seems to work for my basic tests (still pending confirmation from others):
> > 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bb1ee6d7bdde..0d4a11384648 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3125,17 +3125,16 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
> >   void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
> >   {
> >          struct affinity_context ac = {
> > -               .new_mask  = task_user_cpus(p),
> > +               .new_mask  = p->user_cpus_ptr,
> >                  .flags     = 0,
> >          };
> > -       int ret;
> >          /*
> >           * Try to restore the old affinity mask with __sched_setaffinity().
> >           * Cpuset masking will be done there too.
> >           */
> > -       ret = __sched_setaffinity(p, &ac);
> > -       WARN_ON_ONCE(ret);
> > +       if (ac.new_mask)
> > +               WARN_ON_ONCE(__sched_setaffinity(p, &ac));
> >   }
> >   void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > 
> > 
> > With this change, task_user_cpus() is only used by restrict_cpus_allowed_ptr()
> > so I'd be inclined to remove it altogether tbh.
> > 
> > What do you think?
> 
> The problem here is that force_compatible_cpus_allowed_ptr() can be called
> without a matching relax_compatible_cpus_allowed_ptr() at the end. So we may
> end up artificially restrict the number of cpus that can be used when
> running a 64-bit binary.

Hmm, is this because an intervening call to sched_setaffinity() could've
set ->user_cpus_ptr? If so, I'd have thought that would also point to a
superset of the effective affinity -- is that not the case?

> What do you think about the idea of having a bit flag to track that?

I'm not hugely happy with that approach because it's adding additional state
which is only needed for arm64, and only when operating in this funny
asymmetric mode. I also don't understand how it would interact with the new
sched_setaffinity() behaviour; would we need to clear the flag when that
function updates the mask?

Since I'm basically trying to re-instate the v6.1 behaviour to fix the arm64
regression, I'm happy to review/test any proposal you have, but as we get
closer to the 6.2 release I'm wondering whether it would make more sense to
revert the sched_setaffinity() changes for now and I can help you with arm64
review and testing if we bring the changes back for e.g. 6.4.

Will

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-20 17:59       ` Will Deacon
@ 2023-01-20 18:10         ` Waiman Long
  0 siblings, 0 replies; 25+ messages in thread
From: Waiman Long @ 2023-01-20 18:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

On 1/20/23 12:59, Will Deacon wrote:
> Hey Waiman,
>
> Cheers for the quick reply.
>
> On Tue, Jan 17, 2023 at 01:13:31PM -0500, Waiman Long wrote:
>> On 1/17/23 11:08, Will Deacon wrote:
>>> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>>>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>>>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>>>> is currently used only by arm64 arch due to possible asymmetric CPU
>>>> setup. This patch extends its usage to save user provided cpumask
>>>> when sched_setaffinity() is called for all arches. With this patch
>>>> applied, user_cpus_ptr, once allocated after a successful call to
>>>> sched_setaffinity(), will only be freed when the task exits.
> [...]
>
>>> We've tracked this down as the cause of an arm64 regression in Android and I've
>>> reproduced the issue with mainline.
>>>
>>> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
>>> the command-line, then the arch code will (amongst other things) call
>>> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>>> when exec()'ing a 32-bit or a 64-bit task respectively.
>> IOW, relax_compatible_cpus_allowed_ptr() can be called without a previous
>> force_compatible_cpus_allowed_ptr(). Right?
> In practice, these functions are only called by arm64 during exec. As above,
> exec()'ing a 32-bit task calls force_compatible_cpus_allowed_ptr() and
> exec()'ing a 64-bit task calls relax_compatible_cpus_allowed_ptr(). So
> they don't come in pairs at all; it's just that calling relax_[...] should
> try to restore the affinity mask if it was previously clobbered by
> force_[...].
>
That was what I thought.
>> A possible optimization in this case is to add a bit flag in the task_struct
>> to indicate a previous call to force_compatible_cpus_allowed_ptr(). Without
>> that flag set, relax_compatible_cpus_allowed_ptr() can return immediately.
> How is this an optimisation over a pointer comparison?

The sched_setaffinity() patch had repurposed user_cpus_ptr as a user 
requested cpu affinity mask irrespective if 
force_compatible_cpus_allowed_ptr() has been called or not. So checking 
against user_cpus_ptr will no longer serve its purpose as an indicator 
if force_compatible_cpus_allowed_ptr() has been called or not.


>>> I've had a crack at fixing the code above to restore the old behaviour, and it
>>> seems to work for my basic tests (still pending confirmation from others):
>>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index bb1ee6d7bdde..0d4a11384648 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3125,17 +3125,16 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
>>>    void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>>    {
>>>           struct affinity_context ac = {
>>> -               .new_mask  = task_user_cpus(p),
>>> +               .new_mask  = p->user_cpus_ptr,
>>>                   .flags     = 0,
>>>           };
>>> -       int ret;
>>>           /*
>>>            * Try to restore the old affinity mask with __sched_setaffinity().
>>>            * Cpuset masking will be done there too.
>>>            */
>>> -       ret = __sched_setaffinity(p, &ac);
>>> -       WARN_ON_ONCE(ret);
>>> +       if (ac.new_mask)
>>> +               WARN_ON_ONCE(__sched_setaffinity(p, &ac));
>>>    }
>>>    void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>>
>>>
>>> With this change, task_user_cpus() is only used by restrict_cpus_allowed_ptr()
>>> so I'd be inclined to remove it altogether tbh.
>>>
>>> What do you think?
>> The problem here is that force_compatible_cpus_allowed_ptr() can be called
>> without a matching relax_compatible_cpus_allowed_ptr() at the end. So we may
>> end up artificially restrict the number of cpus that can be used when
>> running a 64-bit binary.
> Hmm, is this because an intervening call to sched_setaffinity() could've
> set ->user_cpus_ptr? If so, I'd have thought that would also point to a
> superset of the effective affinity -- is that not the case?
>
>> What do you think about the idea of having a bit flag to track that?
> I'm not hugely happy with that approach because it's adding additional state
> which is only needed for arm64, and only when operating in this funny
> asymmetric mode. I also don't understand how it would interact with the new
> sched_setaffinity() behaviour; would we need to clear the flag when that
> function updates the mask?
The new flag bit will be independent of sched_setaffinity() call. It is 
set when restrict_cpus_allowed_ptr() is called and cleared in 
relax_compatible_cpus_allowed_ptr() if it is set before. I will post a 
patch for your evaluation.
>
> Since I'm basically trying to re-instate the v6.1 behaviour to fix the arm64
> regression, I'm happy to review/test any proposal you have, but as we get
> closer to the 6.2 release I'm wondering whether it would make more sense to
> revert the sched_setaffinity() changes for now and I can help you with arm64
> review and testing if we bring the changes back for e.g. 6.4.

The purpose of the bit flag is to reinstate 6.1 behavior.

Cheers,
Longman

>
> Will
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-17 16:08   ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Will Deacon
  2023-01-17 18:13     ` Waiman Long
@ 2023-01-26 12:52     ` Linux kernel regression tracking (#adding)
  2023-02-10 17:15       ` Linux kernel regression tracking (#update)
  2023-01-27 18:36     ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Linux kernel regression tracking (#adding) @ 2023-01-26 12:52 UTC (permalink / raw)
  To: Will Deacon, Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 17.01.23 17:08, Will Deacon wrote:
> 
> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>> is currently used only by arm64 arch due to possible asymmetric CPU
>> setup. This patch extends its usage to save user provided cpumask
>> when sched_setaffinity() is called for all arches. With this patch
>> applied, user_cpus_ptr, once allocated after a successful call to
>> sched_setaffinity(), will only be freed when the task exits.
> [...]
> We've tracked this down as the cause of an arm64 regression in Android and I've
> reproduced the issue with mainline.
> 
> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
> the command-line, then the arch code will (amongst other things) call
> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
> when exec()'ing a 32-bit or a 64-bit task respectively.
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 8f9ea86fdf99
#regzbot title sched: incorrectly set affinity masks and performance
regression on arm64
#regzbot monitor:
https://lore.kernel.org/all/20230121021749.55313-1-longman@redhat.com/
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-17 16:08   ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Will Deacon
  2023-01-17 18:13     ` Waiman Long
  2023-01-26 12:52     ` Linux kernel regression tracking (#adding)
@ 2023-01-27 18:36     ` Peter Zijlstra
  2023-01-27 19:09       ` Waiman Long
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-01-27 18:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

On Tue, Jan 17, 2023 at 04:08:26PM +0000, Will Deacon wrote:
> Hi Waiman,
> 
> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
> > The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
> > Introduce task_struct::user_cpus_ptr to track requested affinity"). It
> > is currently used only by arm64 arch due to possible asymmetric CPU
> > setup. This patch extends its usage to save user provided cpumask
> > when sched_setaffinity() is called for all arches. With this patch
> > applied, user_cpus_ptr, once allocated after a successful call to
> > sched_setaffinity(), will only be freed when the task exits.
> > 
> > Since user_cpus_ptr is supposed to be used for "requested
> > affinity", there is actually no point to save current cpu affinity in
> > restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
> > Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
> > it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
> > if defined but not changing it.
> > 
> > This will be some changes in behavior for arm64 systems with asymmetric
> > CPUs in some corner cases. For instance, if sched_setaffinity()
> > has never been called and there is a cpuset change before
> > relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
> > follow what the cpuset allows but not what the previous cpu affinity
> > setting allows.
> > 
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> >  kernel/sched/core.c  | 82 ++++++++++++++++++++------------------------
> >  kernel/sched/sched.h |  7 ++++
> >  2 files changed, 44 insertions(+), 45 deletions(-)
> 
> We've tracked this down as the cause of an arm64 regression in Android and I've
> reproduced the issue with mainline.
> 
> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
> the command-line, then the arch code will (amongst other things) call
> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
> when exec()'ing a 32-bit or a 64-bit task respectively.
> 
> If you consider a system where everything is 64-bit but the cmdline option
> above is present, then the call to relax_compatible_cpus_allowed_ptr() isn't
> expected to do anything in this case, and the old code made sure of that:
> 
> > @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
> >  
> >  /*
> >   * Restore the affinity of a task @p which was previously restricted by a
> > - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
> > - * @p->user_cpus_ptr.
> > + * call to force_compatible_cpus_allowed_ptr().
> >   *
> >   * It is the caller's responsibility to serialise this with any calls to
> >   * force_compatible_cpus_allowed_ptr(@p).
> >   */
> >  void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
> >  {
> > -	struct cpumask *user_mask = p->user_cpus_ptr;
> > -	unsigned long flags;
> > +	int ret;
> >  
> >  	/*
> > -	 * Try to restore the old affinity mask. If this fails, then
> > -	 * we free the mask explicitly to avoid it being inherited across
> > -	 * a subsequent fork().
> > +	 * Try to restore the old affinity mask with __sched_setaffinity().
> > +	 * Cpuset masking will be done there too.
> >  	 */
> > -	if (!user_mask || !__sched_setaffinity(p, user_mask))
> > -		return;
> 
> ... since it returned early here if '!user_mask' ...
> 
> > -
> > -	raw_spin_lock_irqsave(&p->pi_lock, flags);
> > -	user_mask = clear_user_cpus_ptr(p);
> > -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> > -
> > -	kfree(user_mask);
> > +	ret = __sched_setaffinity(p, task_user_cpus(p));
> > +	WARN_ON_ONCE(ret);
> 
> ... however, now we end up going down into __sched_setaffinity() with
> task_user_cpus(p) giving us the 'cpu_possible_mask'! This can lead to a mixture
> of WARN_ON()s and incorrect affinity masks (for example, a newly exec'd task
> ends up with the affinity mask of the online CPUs at the point of exec() and is
> unable to run on anything onlined later).
> 
> I've had a crack at fixing the code above to restore the old behaviour, and it
> seems to work for my basic tests (still pending confirmation from others):

This seems to cure things... cpuset is insane and insists on limiting
things to online CPUs for no real reason. It is perfectly fine to have
offline CPUs in the allowed mask (in fact, that's the default
behaviour).

With this on and "relax_compatible_cpus_allowed_ptr(current);" added to
the exec() path things seem to work as expected for me.

I'll clean up and post properly tomorrow (I think there's a simpler
version hiding in there)...

---

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a29c0b13706b..7a63416a46f3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -498,19 +498,33 @@ static inline bool partition_is_populated(struct cpuset *cs,
  *
  * Call with callback_lock or cpuset_rwsem held.
  */
-static void guarantee_online_cpus(struct task_struct *tsk,
-				  struct cpumask *pmask)
+static void guarantee_cs_cpus(struct task_struct *tsk, struct cpumask *pmask, bool online)
 {
-	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
+	const struct cpumask *task_possible_mask = task_cpu_possible_mask(tsk);
+	const struct cpumask *possible_mask = cpu_possible_mask;
+	const struct cpumask *cs_cpus;
 	struct cpuset *cs;
 
-	if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
-		cpumask_copy(pmask, cpu_online_mask);
+	if (online)
+		possible_mask = cpu_online_mask;
+
+	if (WARN_ON(!cpumask_and(pmask, task_possible_mask, possible_mask)))
+		cpumask_copy(pmask, possible_mask);
 
 	rcu_read_lock();
 	cs = task_cs(tsk);
 
-	while (!cpumask_intersects(cs->effective_cpus, pmask)) {
+	if (!parent_cs(cs)) {
+		cs_cpus = cpu_possible_mask;
+		if (online)
+			cs_cpus = cpu_online_mask;
+	} else {
+		cs_cpus = cs->cpus_allowed;
+		if (online)
+			cs_cpus = cs->effective_cpus;
+	}
+
+	while (!cpumask_intersects(cs_cpus, pmask)) {
 		cs = parent_cs(cs);
 		if (unlikely(!cs)) {
 			/*
@@ -523,7 +537,8 @@ static void guarantee_online_cpus(struct task_struct *tsk,
 			goto out_unlock;
 		}
 	}
-	cpumask_and(pmask, pmask, cs->effective_cpus);
+
+	cpumask_and(pmask, pmask, cs_cpus);
 
 out_unlock:
 	rcu_read_unlock();
@@ -2540,7 +2555,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_for_each(task, css, tset) {
 		if (cs != &top_cpuset)
-			guarantee_online_cpus(task, cpus_attach);
+			guarantee_cs_cpus(task, cpus_attach, true);
 		else
 			cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
 		/*
@@ -3699,7 +3714,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 	unsigned long flags;
 
 	spin_lock_irqsave(&callback_lock, flags);
-	guarantee_online_cpus(tsk, pmask);
+	guarantee_cs_cpus(tsk, pmask, false);
 	spin_unlock_irqrestore(&callback_lock, flags);
 }
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-27 18:36     ` Peter Zijlstra
@ 2023-01-27 19:09       ` Waiman Long
  0 siblings, 0 replies; 25+ messages in thread
From: Waiman Long @ 2023-01-27 19:09 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret, Tejun Heo

On 1/27/23 13:36, Peter Zijlstra wrote:
> On Tue, Jan 17, 2023 at 04:08:26PM +0000, Will Deacon wrote:
>> Hi Waiman,
>>
>> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>>> is currently used only by arm64 arch due to possible asymmetric CPU
>>> setup. This patch extends its usage to save user provided cpumask
>>> when sched_setaffinity() is called for all arches. With this patch
>>> applied, user_cpus_ptr, once allocated after a successful call to
>>> sched_setaffinity(), will only be freed when the task exits.
>>>
>>> Since user_cpus_ptr is supposed to be used for "requested
>>> affinity", there is actually no point to save current cpu affinity in
>>> restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
>>> Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
>>> it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>>> if defined but not changing it.
>>>
>>> This will be some changes in behavior for arm64 systems with asymmetric
>>> CPUs in some corner cases. For instance, if sched_setaffinity()
>>> has never been called and there is a cpuset change before
>>> relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
>>> follow what the cpuset allows but not what the previous cpu affinity
>>> setting allows.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   kernel/sched/core.c  | 82 ++++++++++++++++++++------------------------
>>>   kernel/sched/sched.h |  7 ++++
>>>   2 files changed, 44 insertions(+), 45 deletions(-)
>> We've tracked this down as the cause of an arm64 regression in Android and I've
>> reproduced the issue with mainline.
>>
>> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
>> the command-line, then the arch code will (amongst other things) call
>> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>> when exec()'ing a 32-bit or a 64-bit task respectively.
>>
>> If you consider a system where everything is 64-bit but the cmdline option
>> above is present, then the call to relax_compatible_cpus_allowed_ptr() isn't
>> expected to do anything in this case, and the old code made sure of that:
>>
>>> @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
>>>   
>>>   /*
>>>    * Restore the affinity of a task @p which was previously restricted by a
>>> - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
>>> - * @p->user_cpus_ptr.
>>> + * call to force_compatible_cpus_allowed_ptr().
>>>    *
>>>    * It is the caller's responsibility to serialise this with any calls to
>>>    * force_compatible_cpus_allowed_ptr(@p).
>>>    */
>>>   void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>>   {
>>> -	struct cpumask *user_mask = p->user_cpus_ptr;
>>> -	unsigned long flags;
>>> +	int ret;
>>>   
>>>   	/*
>>> -	 * Try to restore the old affinity mask. If this fails, then
>>> -	 * we free the mask explicitly to avoid it being inherited across
>>> -	 * a subsequent fork().
>>> +	 * Try to restore the old affinity mask with __sched_setaffinity().
>>> +	 * Cpuset masking will be done there too.
>>>   	 */
>>> -	if (!user_mask || !__sched_setaffinity(p, user_mask))
>>> -		return;
>> ... since it returned early here if '!user_mask' ...
>>
>>> -
>>> -	raw_spin_lock_irqsave(&p->pi_lock, flags);
>>> -	user_mask = clear_user_cpus_ptr(p);
>>> -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>>> -
>>> -	kfree(user_mask);
>>> +	ret = __sched_setaffinity(p, task_user_cpus(p));
>>> +	WARN_ON_ONCE(ret);
>> ... however, now we end up going down into __sched_setaffinity() with
>> task_user_cpus(p) giving us the 'cpu_possible_mask'! This can lead to a mixture
>> of WARN_ON()s and incorrect affinity masks (for example, a newly exec'd task
>> ends up with the affinity mask of the online CPUs at the point of exec() and is
>> unable to run on anything onlined later).
>>
>> I've had a crack at fixing the code above to restore the old behaviour, and it
>> seems to work for my basic tests (still pending confirmation from others):
> This seems to cure things... cpuset is insane and insists on limiting
> things to online CPUs for no real reason. It is perfectly fine to have
> offline CPUs in the allowed mask (in fact, that's the default
> behaviour).
>
> With this on and "relax_compatible_cpus_allowed_ptr(current);" added to
> the exec() path things seem to work as expected for me.
>
> I'll clean up and post properly tomorrow (I think there's a simpler
> version hiding in there)...
>
> ---
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a29c0b13706b..7a63416a46f3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -498,19 +498,33 @@ static inline bool partition_is_populated(struct cpuset *cs,
>    *
>    * Call with callback_lock or cpuset_rwsem held.
>    */
> -static void guarantee_online_cpus(struct task_struct *tsk,
> -				  struct cpumask *pmask)
> +static void guarantee_cs_cpus(struct task_struct *tsk, struct cpumask *pmask, bool online)
>   {
> -	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +	const struct cpumask *task_possible_mask = task_cpu_possible_mask(tsk);
> +	const struct cpumask *possible_mask = cpu_possible_mask;
> +	const struct cpumask *cs_cpus;
>   	struct cpuset *cs;
>   
> -	if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
> -		cpumask_copy(pmask, cpu_online_mask);
> +	if (online)
> +		possible_mask = cpu_online_mask;
> +
> +	if (WARN_ON(!cpumask_and(pmask, task_possible_mask, possible_mask)))
> +		cpumask_copy(pmask, possible_mask);
>   
>   	rcu_read_lock();
>   	cs = task_cs(tsk);
>   
> -	while (!cpumask_intersects(cs->effective_cpus, pmask)) {
> +	if (!parent_cs(cs)) {
> +		cs_cpus = cpu_possible_mask;
> +		if (online)
> +			cs_cpus = cpu_online_mask;
> +	} else {
> +		cs_cpus = cs->cpus_allowed;
> +		if (online)
> +			cs_cpus = cs->effective_cpus;

This may not be the right thing to do to use cpus_allowed directly in 
the case of cgroup v2. In v2, cpus_allowed starts as empty and 
effective_cpus inherit from its parent. So we may have to go up the 
cpuset hierarchy to arrive at the proper cpus_allowed to use. We may 
need another helper to do that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
  2023-01-26 12:52     ` Linux kernel regression tracking (#adding)
@ 2023-02-10 17:15       ` Linux kernel regression tracking (#update)
  0 siblings, 0 replies; 25+ messages in thread
From: Linux kernel regression tracking (#update) @ 2023-02-10 17:15 UTC (permalink / raw)
  To: Will Deacon, Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, linux-kernel, Linus Torvalds,
	Lai Jiangshan, qperret

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. There afaics is a fix for the regression discussed in this
thread, but it did not use a Link: tag to point to the report, as Linus
and the documentation call for. Things happen, no worries -- but it
forces me to write this mail to make the regression tracking bot aware
of the fix. See link in footer if these mails annoy you.]

On 26.01.23 13:52, Linux kernel regression tracking (#adding) wrote:
> On 17.01.23 17:08, Will Deacon wrote:
>>
>> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>>> is currently used only by arm64 arch due to possible asymmetric CPU
>>> setup. This patch extends its usage to save user provided cpumask
>>> when sched_setaffinity() is called for all arches. With this patch
>>> applied, user_cpus_ptr, once allocated after a successful call to
>>> sched_setaffinity(), will only be freed when the task exits.
>> [...]
>> We've tracked this down as the cause of an arm64 regression in Android and I've
>> reproduced the issue with mainline.
>>
>> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
>> the command-line, then the arch code will (amongst other things) call
>> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>> when exec()'ing a 32-bit or a 64-bit task respectively.
>> [...]
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 8f9ea86fdf99

#regzbot fix: 3fb906e7fabbb5b76c3c
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-02-10 17:15 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 18:00 [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long
2022-09-22 18:00 ` [PATCH v10 1/5] sched: Add __releases annotations to affine_move_task() Waiman Long
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
2022-09-22 18:00 ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Waiman Long
2022-10-28  6:42   ` [tip: sched/core] sched: Always preserve the user requested cpumask tip-bot2 for Waiman Long
2023-01-17 16:08   ` [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Will Deacon
2023-01-17 18:13     ` Waiman Long
2023-01-20 17:59       ` Will Deacon
2023-01-20 18:10         ` Waiman Long
2023-01-26 12:52     ` Linux kernel regression tracking (#adding)
2023-02-10 17:15       ` Linux kernel regression tracking (#update)
2023-01-27 18:36     ` Peter Zijlstra
2023-01-27 19:09       ` Waiman Long
2022-09-22 18:00 ` [PATCH v10 3/5] sched: Enforce user requested affinity Waiman Long
2022-10-07 10:01   ` Peter Zijlstra
2022-10-07 14:57     ` Waiman Long
2022-10-07 15:23       ` Waiman Long
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
2022-09-22 18:00 ` [PATCH v10 4/5] sched: Handle set_cpus_allowed_ptr(), sched_setaffinity() & other races Waiman Long
2022-10-07 12:47   ` Peter Zijlstra
2022-10-07 18:59     ` Waiman Long
2022-10-28  6:42   ` [tip: sched/core] sched: Introduce affinity_context tip-bot2 for Waiman Long
2022-09-22 18:00 ` [PATCH v10 5/5] sched: Always clear user_cpus_ptr in do_set_cpus_allowed() Waiman Long
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Waiman Long
2022-10-06 21:31 ` [PATCH v10 0/5] sched: Persistent user requested affinity Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).