linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] sched: SCA vs hotplug vs stopper races fixes
@ 2021-05-26 20:57 Valentin Schneider
  2021-05-26 20:57 ` [PATCH 1/2] sched: Don't defer CPU pick to migration_cpu_stop() Valentin Schneider
  2021-05-26 20:57 ` [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop() Valentin Schneider
  0 siblings, 2 replies; 10+ messages in thread
From: Valentin Schneider @ 2021-05-26 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

Hi folks,

This is the continuation of [1]. As Will noted, that patch isn't sufficient to
plug all the nasty races involving SCA, hotplug and the stopper task, hence
patch 2.

I have to apologize as this didn't see much testing (a CPU hog, a crazed
taskset, and some hotplugs in a loop), and unfortunately I need to call it a day
before running away to the british wilderness 'till Monday. I'll get back to it
then to expunge the remaining daftness.  

[1]: http://lore.kernel.org/r/877djlhhmb.mognet@arm.com

Cheers,
Valentin

Valentin Schneider (2):
  sched: Don't defer CPU pick to migration_cpu_stop()
  sched: Plug race between SCA, hotplug and migration_cpu_stop()

 kernel/sched/core.c | 50 ++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 17 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] sched: Don't defer CPU pick to migration_cpu_stop()
  2021-05-26 20:57 [PATCH 0/2] sched: SCA vs hotplug vs stopper races fixes Valentin Schneider
@ 2021-05-26 20:57 ` Valentin Schneider
  2021-06-01 14:04   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  2021-05-26 20:57 ` [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop() Valentin Schneider
  1 sibling, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2021-05-26 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

Will reported that the 'XXX __migrate_task() can fail' in migration_cpu_stop()
can happen, and it *is* sort of a big deal. Looking at it some more, one
will note there is a glaring hole in the deferred CPU selection:

  (w/ CONFIG_CPUSET=n, so that the affinity mask passed via taskset doesn't
  get AND'd with cpu_online_mask)

  $ taskset -pc 0-2 $PID
  # offline CPUs 3-4
  $ taskset -pc 3-5 $PID
    `\
      $PID may stay on 0-2 due to the cpumask_any_distribute() picking an
      offline CPU and __migrate_task() refusing to do anything due to
      cpu_is_allowed().

set_cpus_allowed_ptr() goes to some length to pick a dest_cpu that matches
the right constraints vs affinity and the online/active state of the
CPUs. Reuse that instead of discarding it in the affine_move_task() case.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/core.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d2527239c3e..d40511c55e80 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2273,7 +2273,6 @@ static int migration_cpu_stop(void *data)
 	struct migration_arg *arg = data;
 	struct set_affinity_pending *pending = arg->pending;
 	struct task_struct *p = arg->task;
-	int dest_cpu = arg->dest_cpu;
 	struct rq *rq = this_rq();
 	bool complete = false;
 	struct rq_flags rf;
@@ -2311,19 +2310,15 @@ static int migration_cpu_stop(void *data)
 		if (pending) {
 			p->migration_pending = NULL;
 			complete = true;
-		}
 
-		if (dest_cpu < 0) {
 			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
 				goto out;
-
-			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
 		}
 
 		if (task_on_rq_queued(p))
-			rq = __migrate_task(rq, &rf, p, dest_cpu);
+			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
 		else
-			p->wake_cpu = dest_cpu;
+			p->wake_cpu = arg->dest_cpu;
 
 		/*
 		 * XXX __migrate_task() can fail, at which point we might end
@@ -2606,7 +2601,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 			init_completion(&my_pending.done);
 			my_pending.arg = (struct migration_arg) {
 				.task = p,
-				.dest_cpu = -1,		/* any */
+				.dest_cpu = dest_cpu,
 				.pending = &my_pending,
 			};
 
@@ -2614,6 +2609,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 		} else {
 			pending = p->migration_pending;
 			refcount_inc(&pending->refs);
+			/*
+			 * Affinity has changed, but we've already installed a
+			 * pending. migration_cpu_stop() *must* see this, else
+			 * we risk a completion of the pending despite having a
+			 * task on a disallowed CPU.
+			 *
+			 * Serialized by p->pi_lock, so this is safe.
+			 */
+			pending->arg.dest_cpu = dest_cpu;
 		}
 	}
 	pending = p->migration_pending;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-05-26 20:57 [PATCH 0/2] sched: SCA vs hotplug vs stopper races fixes Valentin Schneider
  2021-05-26 20:57 ` [PATCH 1/2] sched: Don't defer CPU pick to migration_cpu_stop() Valentin Schneider
@ 2021-05-26 20:57 ` Valentin Schneider
  2021-05-27  9:12   ` Peter Zijlstra
  2021-06-01 16:59   ` Valentin Schneider
  1 sibling, 2 replies; 10+ messages in thread
From: Valentin Schneider @ 2021-05-26 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

migration_cpu_stop() is handed a CPU that was a valid pick at the time
set_cpus_allowed_ptr() was invoked. However, hotplug may have happened
between then and the stopper work being executed, meaning
__move_queued_task() could fail when passed arg.dest_cpu. This could mean
leaving a task on its current CPU (despite it being outside the task's
affinity) up until its next wakeup, which is a big no-no.

Verify the validity of the pick in the stopper callback, and invoke
select_fallback_rq() there if need be.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/core.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d40511c55e80..0679efb6c22d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2263,6 +2263,8 @@ static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
 	return rq;
 }
 
+static int select_fallback_rq(int cpu, struct task_struct *p);
+
 /*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
@@ -2276,6 +2278,7 @@ static int migration_cpu_stop(void *data)
 	struct rq *rq = this_rq();
 	bool complete = false;
 	struct rq_flags rf;
+	int dest_cpu;
 
 	/*
 	 * The original target CPU might have gone down and we might
@@ -2315,18 +2318,27 @@ static int migration_cpu_stop(void *data)
 				goto out;
 		}
 
-		if (task_on_rq_queued(p))
-			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
-		else
-			p->wake_cpu = arg->dest_cpu;
-
-		/*
-		 * XXX __migrate_task() can fail, at which point we might end
-		 * up running on a dodgy CPU, AFAICT this can only happen
-		 * during CPU hotplug, at which point we'll get pushed out
-		 * anyway, so it's probably not a big deal.
-		 */
-
+		dest_cpu = arg->dest_cpu;
+		if (task_on_rq_queued(p)) {
+			/*
+			 * A hotplug operation could have happened between
+			 * set_cpus_allowed_ptr() and here, making dest_cpu no
+			 * longer allowed.
+			 */
+			if (!is_cpu_allowed(p, dest_cpu))
+				dest_cpu = select_fallback_rq(cpu_of(rq), p);
+			/*
+			 * dest_cpu can be victim of hotplug between is_cpu_allowed()
+			 * and here. However, per the synchronize_rcu() in
+			 * sched_cpu_deactivate(), it can't have gone lower than
+			 * CPUHP_AP_ACTIVE, so it's safe to punt it over and let
+			 * balance_push() route it elsewhere.
+			 */
+			update_rq_clock(rq);
+			rq = move_queued_task(rq, &rf, p, dest_cpu);
+		} else {
+			p->wake_cpu = dest_cpu;
+		}
 	} else if (pending) {
 		/*
 		 * This happens when we get migrated between migrate_enable()'s
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-05-26 20:57 ` [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop() Valentin Schneider
@ 2021-05-27  9:12   ` Peter Zijlstra
  2021-06-01 16:59   ` Valentin Schneider
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2021-05-27  9:12 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Will Deacon, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

On Wed, May 26, 2021 at 09:57:51PM +0100, Valentin Schneider wrote:

> -			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);

Suggests we ought to at the very least include something like the below.

/me continues reading...

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2246,28 +2246,6 @@ struct set_affinity_pending {
 	struct migration_arg	arg;
 };
 
-/*
- * Move (not current) task off this CPU, onto the destination CPU. We're doing
- * this because either it can't run here any more (set_cpus_allowed()
- * away from this CPU, or CPU going down), or because we're
- * attempting to rebalance this task on exec (sched_exec).
- *
- * So we race with normal scheduler movements, but that's OK, as long
- * as the task is no longer on this CPU.
- */
-static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
-				 struct task_struct *p, int dest_cpu)
-{
-	/* Affinity changed (again). */
-	if (!is_cpu_allowed(p, dest_cpu))
-		return rq;
-
-	update_rq_clock(rq);
-	rq = move_queued_task(rq, rf, p, dest_cpu);
-
-	return rq;
-}
-
 static int select_fallback_rq(int cpu, struct task_struct *p);
 
 /*
@@ -2292,7 +2270,7 @@ static int migration_cpu_stop(void *data
 	local_irq_save(rf.flags);
 	/*
 	 * We need to explicitly wake pending tasks before running
-	 * __migrate_task() such that we will not miss enforcing cpus_ptr
+	 * move_queued_task() such that we will not miss enforcing cpus_ptr
 	 * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
 	 */
 	flush_smp_call_function_from_idle();
@@ -8463,7 +8441,7 @@ static int __balance_push_cpu_stop(void
 
 	if (task_rq(p) == rq && task_on_rq_queued(p)) {
 		cpu = select_fallback_rq(rq->cpu, p);
-		rq = __migrate_task(rq, &rf, p, cpu);
+		rq = move_queued_task(rq, &rf, p, cpu);
 	}
 
 	rq_unlock(rq, &rf);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip: sched/core] sched: Don't defer CPU pick to migration_cpu_stop()
  2021-05-26 20:57 ` [PATCH 1/2] sched: Don't defer CPU pick to migration_cpu_stop() Valentin Schneider
@ 2021-06-01 14:04   ` tip-bot2 for Valentin Schneider
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2021-06-01 14:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Will Deacon, Valentin Schneider, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     475ea6c60279e9f2ddf7e4cf2648cd8ae0608361
Gitweb:        https://git.kernel.org/tip/475ea6c60279e9f2ddf7e4cf2648cd8ae0608361
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Wed, 26 May 2021 21:57:50 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 01 Jun 2021 16:00:11 +02:00

sched: Don't defer CPU pick to migration_cpu_stop()

Will reported that the 'XXX __migrate_task() can fail' in migration_cpu_stop()
can happen, and it *is* sort of a big deal. Looking at it some more, one
will note there is a glaring hole in the deferred CPU selection:

  (w/ CONFIG_CPUSET=n, so that the affinity mask passed via taskset doesn't
  get AND'd with cpu_online_mask)

  $ taskset -pc 0-2 $PID
  # offline CPUs 3-4
  $ taskset -pc 3-5 $PID
    `\
      $PID may stay on 0-2 due to the cpumask_any_distribute() picking an
      offline CPU and __migrate_task() refusing to do anything due to
      cpu_is_allowed().

set_cpus_allowed_ptr() goes to some length to pick a dest_cpu that matches
the right constraints vs affinity and the online/active state of the
CPUs. Reuse that instead of discarding it in the affine_move_task() case.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210526205751.842360-2-valentin.schneider@arm.com
---
 kernel/sched/core.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e205c19..7e59466 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2273,7 +2273,6 @@ static int migration_cpu_stop(void *data)
 	struct migration_arg *arg = data;
 	struct set_affinity_pending *pending = arg->pending;
 	struct task_struct *p = arg->task;
-	int dest_cpu = arg->dest_cpu;
 	struct rq *rq = this_rq();
 	bool complete = false;
 	struct rq_flags rf;
@@ -2311,19 +2310,15 @@ static int migration_cpu_stop(void *data)
 		if (pending) {
 			p->migration_pending = NULL;
 			complete = true;
-		}
 
-		if (dest_cpu < 0) {
 			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
 				goto out;
-
-			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
 		}
 
 		if (task_on_rq_queued(p))
-			rq = __migrate_task(rq, &rf, p, dest_cpu);
+			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
 		else
-			p->wake_cpu = dest_cpu;
+			p->wake_cpu = arg->dest_cpu;
 
 		/*
 		 * XXX __migrate_task() can fail, at which point we might end
@@ -2606,7 +2601,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 			init_completion(&my_pending.done);
 			my_pending.arg = (struct migration_arg) {
 				.task = p,
-				.dest_cpu = -1,		/* any */
+				.dest_cpu = dest_cpu,
 				.pending = &my_pending,
 			};
 
@@ -2614,6 +2609,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 		} else {
 			pending = p->migration_pending;
 			refcount_inc(&pending->refs);
+			/*
+			 * Affinity has changed, but we've already installed a
+			 * pending. migration_cpu_stop() *must* see this, else
+			 * we risk a completion of the pending despite having a
+			 * task on a disallowed CPU.
+			 *
+			 * Serialized by p->pi_lock, so this is safe.
+			 */
+			pending->arg.dest_cpu = dest_cpu;
 		}
 	}
 	pending = p->migration_pending;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-05-26 20:57 ` [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop() Valentin Schneider
  2021-05-27  9:12   ` Peter Zijlstra
@ 2021-06-01 16:59   ` Valentin Schneider
  2021-06-02 15:26     ` Will Deacon
  1 sibling, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2021-06-01 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

On 26/05/21 21:57, Valentin Schneider wrote:
> +		dest_cpu = arg->dest_cpu;
> +		if (task_on_rq_queued(p)) {
> +			/*
> +			 * A hotplug operation could have happened between
> +			 * set_cpus_allowed_ptr() and here, making dest_cpu no
> +			 * longer allowed.
> +			 */
> +			if (!is_cpu_allowed(p, dest_cpu))
> +				dest_cpu = select_fallback_rq(cpu_of(rq), p);
> +			/*
> +			 * dest_cpu can be victim of hotplug between is_cpu_allowed()
> +			 * and here. However, per the synchronize_rcu() in
> +			 * sched_cpu_deactivate(), it can't have gone lower than
> +			 * CPUHP_AP_ACTIVE, so it's safe to punt it over and let
> +			 * balance_push() route it elsewhere.
> +			 */
> +			update_rq_clock(rq);
> +			rq = move_queued_task(rq, &rf, p, dest_cpu);

So, while digesting this I started having doubts vs pcpu kthreads since
they're allowed on online CPUs. The bogus scenario here would be picking a
!active && online CPU, and see it go !online before the move_queued_task().

Now, to transition from online -> !online, we have to go through
take_cpu_down() which is issued via a stop_machine() call. This means the
transition can't happen until all online CPUs are running the stopper task
and reach MULTI_STOP_RUN.

migration_cpu_stop() being already a stopper callback should thus make it
"atomic" vs takedown_cpu(), meaning the above should be fine.

> +		} else {
> +			p->wake_cpu = dest_cpu;
> +		}
>       } else if (pending) {
>               /*
>                * This happens when we get migrated between migrate_enable()'s
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-06-01 16:59   ` Valentin Schneider
@ 2021-06-02 15:26     ` Will Deacon
  2021-06-02 19:43       ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2021-06-02 15:26 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

On Tue, Jun 01, 2021 at 05:59:56PM +0100, Valentin Schneider wrote:
> On 26/05/21 21:57, Valentin Schneider wrote:
> > +		dest_cpu = arg->dest_cpu;
> > +		if (task_on_rq_queued(p)) {
> > +			/*
> > +			 * A hotplug operation could have happened between
> > +			 * set_cpus_allowed_ptr() and here, making dest_cpu no
> > +			 * longer allowed.
> > +			 */
> > +			if (!is_cpu_allowed(p, dest_cpu))
> > +				dest_cpu = select_fallback_rq(cpu_of(rq), p);
> > +			/*
> > +			 * dest_cpu can be victim of hotplug between is_cpu_allowed()
> > +			 * and here. However, per the synchronize_rcu() in
> > +			 * sched_cpu_deactivate(), it can't have gone lower than
> > +			 * CPUHP_AP_ACTIVE, so it's safe to punt it over and let
> > +			 * balance_push() route it elsewhere.
> > +			 */
> > +			update_rq_clock(rq);
> > +			rq = move_queued_task(rq, &rf, p, dest_cpu);
> 
> So, while digesting this I started having doubts vs pcpu kthreads since
> they're allowed on online CPUs. The bogus scenario here would be picking a
> !active && online CPU, and see it go !online before the move_queued_task().
> 
> Now, to transition from online -> !online, we have to go through
> take_cpu_down() which is issued via a stop_machine() call. This means the
> transition can't happen until all online CPUs are running the stopper task
> and reach MULTI_STOP_RUN.
> 
> migration_cpu_stop() being already a stopper callback should thus make it
> "atomic" vs takedown_cpu(), meaning the above should be fine.

I'd be more inclined to agree with your reasoning if migration_cpu_stop()
couldn't itself call stop_one_cpu_nowait() to queue more work for the
stopper thread. What guarantees that takedown_cpu() can't queue its stopper
work in the middle of that?

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-06-02 15:26     ` Will Deacon
@ 2021-06-02 19:43       ` Valentin Schneider
  2021-06-22 13:51         ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2021-06-02 19:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

On 02/06/21 16:26, Will Deacon wrote:
> On Tue, Jun 01, 2021 at 05:59:56PM +0100, Valentin Schneider wrote:
>> On 26/05/21 21:57, Valentin Schneider wrote:
>> > +		dest_cpu = arg->dest_cpu;
>> > +		if (task_on_rq_queued(p)) {
>> > +			/*
>> > +			 * A hotplug operation could have happened between
>> > +			 * set_cpus_allowed_ptr() and here, making dest_cpu no
>> > +			 * longer allowed.
>> > +			 */
>> > +			if (!is_cpu_allowed(p, dest_cpu))
>> > +				dest_cpu = select_fallback_rq(cpu_of(rq), p);
>> > +			/*
>> > +			 * dest_cpu can be victim of hotplug between is_cpu_allowed()
>> > +			 * and here. However, per the synchronize_rcu() in
>> > +			 * sched_cpu_deactivate(), it can't have gone lower than
>> > +			 * CPUHP_AP_ACTIVE, so it's safe to punt it over and let
>> > +			 * balance_push() route it elsewhere.
>> > +			 */
>> > +			update_rq_clock(rq);
>> > +			rq = move_queued_task(rq, &rf, p, dest_cpu);
>>
>> So, while digesting this I started having doubts vs pcpu kthreads since
>> they're allowed on online CPUs. The bogus scenario here would be picking a
>> !active && online CPU, and see it go !online before the move_queued_task().
>>
>> Now, to transition from online -> !online, we have to go through
>> take_cpu_down() which is issued via a stop_machine() call. This means the
>> transition can't happen until all online CPUs are running the stopper task
>> and reach MULTI_STOP_RUN.
>>
>> migration_cpu_stop() being already a stopper callback should thus make it
>> "atomic" vs takedown_cpu(), meaning the above should be fine.
>
> I'd be more inclined to agree with your reasoning if migration_cpu_stop()
> couldn't itself call stop_one_cpu_nowait() to queue more work for the
> stopper thread. What guarantees that takedown_cpu() can't queue its stopper
> work in the middle of that?
>

Harumph...

So something like all CPUs but one are running their take_cpu_down()
callback because one is still running migration_cpu_stop(), i.e.:

  CPU0                   CPU1                ...             CPUN
  <stopper>              <stopper>                           <stopper>
    migration_cpu_stop()   take_cpu_down()@MULTI_STOP_PREPARE    take_cpu_down()@MULTI_STOP_PREPARE

If CPU0 hits that else if (pending) condition, it'll queue a
migration_cpu_stop() elsewhere (say CPU1), then run the take_cpu_down()
callback which follows in its work->list.

If the CPU being brought down is anything else than CPU1, it shouldn't
really matter. If it *is* CPU1, then I think we've got some guarantees.

Namely, there's no (read: there shouldn't be any) way for a task to
still be on CPU1 at this point; per sched_cpu_wait_empty(),
migration-disabled tasks and pcpu kthreads must vacate the CPU before it
then (migrate_disable regions must be bounded, and pcpu kthreads are
expected to be moved away by their respective owners).

With the above, I don't really see how migration_cpu_stop() could queue
itself again on an about-to-die CPU (thus racing with the take_cpu_down()
callback), and even if it did, then we'd still run the callback before
parking the stopper thread (no surprise callback persisting during
offline), in which case we'd either see the task as having moved somewhere
sensible, or we'll queue yet another callback.


I've lost track of how many times I rewrote the above, so it's probably to
be taken with a healthy pinch of salt.

> Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-06-02 19:43       ` Valentin Schneider
@ 2021-06-22 13:51         ` Will Deacon
  2021-06-24  9:33           ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2021-06-22 13:51 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

Hi Valentin,

I've been looking at this on and off and I'm afraid I'm still not convinced,
almost certainly due to my own ignorance, but hey :)

On Wed, Jun 02, 2021 at 08:43:31PM +0100, Valentin Schneider wrote:
> On 02/06/21 16:26, Will Deacon wrote:
> > On Tue, Jun 01, 2021 at 05:59:56PM +0100, Valentin Schneider wrote:
> >> On 26/05/21 21:57, Valentin Schneider wrote:
> >> > +		dest_cpu = arg->dest_cpu;
> >> > +		if (task_on_rq_queued(p)) {
> >> > +			/*
> >> > +			 * A hotplug operation could have happened between
> >> > +			 * set_cpus_allowed_ptr() and here, making dest_cpu no
> >> > +			 * longer allowed.
> >> > +			 */
> >> > +			if (!is_cpu_allowed(p, dest_cpu))
> >> > +				dest_cpu = select_fallback_rq(cpu_of(rq), p);
> >> > +			/*
> >> > +			 * dest_cpu can be victim of hotplug between is_cpu_allowed()
> >> > +			 * and here. However, per the synchronize_rcu() in
> >> > +			 * sched_cpu_deactivate(), it can't have gone lower than
> >> > +			 * CPUHP_AP_ACTIVE, so it's safe to punt it over and let
> >> > +			 * balance_push() route it elsewhere.
> >> > +			 */
> >> > +			update_rq_clock(rq);
> >> > +			rq = move_queued_task(rq, &rf, p, dest_cpu);
> >>
> >> So, while digesting this I started having doubts vs pcpu kthreads since
> >> they're allowed on online CPUs. The bogus scenario here would be picking a
> >> !active && online CPU, and see it go !online before the move_queued_task().
> >>
> >> Now, to transition from online -> !online, we have to go through
> >> take_cpu_down() which is issued via a stop_machine() call. This means the
> >> transition can't happen until all online CPUs are running the stopper task
> >> and reach MULTI_STOP_RUN.
> >>
> >> migration_cpu_stop() being already a stopper callback should thus make it
> >> "atomic" vs takedown_cpu(), meaning the above should be fine.
> >
> > I'd be more inclined to agree with your reasoning if migration_cpu_stop()
> > couldn't itself call stop_one_cpu_nowait() to queue more work for the
> > stopper thread. What guarantees that takedown_cpu() can't queue its stopper
> > work in the middle of that?
> >
> 
> Harumph...
> 
> So something like all CPUs but one are running their take_cpu_down()
> callback because one is still running migration_cpu_stop(), i.e.:
> 
>   CPU0                   CPU1                ...             CPUN
>   <stopper>              <stopper>                           <stopper>
>     migration_cpu_stop()   take_cpu_down()@MULTI_STOP_PREPARE    take_cpu_down()@MULTI_STOP_PREPARE
> 
> If CPU0 hits that else if (pending) condition, it'll queue a
> migration_cpu_stop() elsewhere (say CPU1), then run the take_cpu_down()
> callback which follows in its work->list.
> 
> If the CPU being brought down is anything else than CPU1, it shouldn't
> really matter. If it *is* CPU1, then I think we've got some guarantees.
> 
> Namely, there's no (read: there shouldn't be any) way for a task to
> still be on CPU1 at this point; per sched_cpu_wait_empty(),
> migration-disabled tasks and pcpu kthreads must vacate the CPU before it
> then (migrate_disable regions must be bounded, and pcpu kthreads are
> expected to be moved away by their respective owners).

I agree with that, but the stopper is still running on CPU1 and so
migration_cpu_stop() could still queue work there after sched_cpu_wait_empty()
has returned but before stop_machine_park(), afaict.

Actually, it looks like migration_cpu_stop() ignores the return value of
stop_one_cpu_nowait(), so if the stopper thread has been parked I think
we'll quietly do nothing there as well.

> With the above, I don't really see how migration_cpu_stop() could queue
> itself again on an about-to-die CPU (thus racing with the take_cpu_down()
> callback), and even if it did, then we'd still run the callback before
> parking the stopper thread (no surprise callback persisting during
> offline), in which case we'd either see the task as having moved somewhere
> sensible, or we'll queue yet another callback.

In this case, I think we'd run the migration_cpu_stop() callback, but
then __migrate_task() would fail to move the task because the CPU would
be !online.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop()
  2021-06-22 13:51         ` Will Deacon
@ 2021-06-24  9:33           ` Valentin Schneider
  0 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2021-06-24  9:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Dietmar Eggemann, Quentin Perret, Juri Lelli,
	Vincent Guittot, kernel-team

Hi,

On 22/06/21 14:51, Will Deacon wrote:
> Hi Valentin,
>
> I've been looking at this on and off and I'm afraid I'm still not convinced,
> almost certainly due to my own ignorance, but hey :)
>

Nah, this thing is nasty, I still haven't regrown the hair I lost from last
time I stared at it.

> On Wed, Jun 02, 2021 at 08:43:31PM +0100, Valentin Schneider wrote:
>> Harumph...
>>
>> So something like all CPUs but one are running their take_cpu_down()
>> callback because one is still running migration_cpu_stop(), i.e.:
>>
>>   CPU0                   CPU1                ...             CPUN
>>   <stopper>              <stopper>                           <stopper>
>>     migration_cpu_stop()   take_cpu_down()@MULTI_STOP_PREPARE    take_cpu_down()@MULTI_STOP_PREPARE
>>
>> If CPU0 hits that else if (pending) condition, it'll queue a
>> migration_cpu_stop() elsewhere (say CPU1), then run the take_cpu_down()
>> callback which follows in its work->list.
>>
>> If the CPU being brought down is anything else than CPU1, it shouldn't
>> really matter. If it *is* CPU1, then I think we've got some guarantees.
>>
>> Namely, there's no (read: there shouldn't be any) way for a task to
>> still be on CPU1 at this point; per sched_cpu_wait_empty(),
>> migration-disabled tasks and pcpu kthreads must vacate the CPU before it
>> then (migrate_disable regions must be bounded, and pcpu kthreads are
>> expected to be moved away by their respective owners).
>
> I agree with that, but the stopper is still running on CPU1 and so
> migration_cpu_stop() could still queue work there after sched_cpu_wait_empty()
> has returned but before stop_machine_park(), afaict.
>

Right, the stopper should flush its work before parking itself - in the
above scenario migration_cpu_stop() is already on CPU1's work list when it
finishes running take_cpu_down(), so that should run before any parking
happens.

> Actually, it looks like migration_cpu_stop() ignores the return value of
> stop_one_cpu_nowait(), so if the stopper thread has been parked I think
> we'll quietly do nothing there as well.
>

There's a handful of loosely related cogs (including those added by the
patch) that *I think* give us sufficient guarantees this can't happen. Let
me try to structure this somehow, and please point out any inane bit.

1) Per the synchronize_rcu() in CPUHP_AP_ACTIVE.teardown(), the return
   value of is_cpu_allowed(p, cpu) is stable within a migration_cpu_stop()
   invocation for any p allowed on *active* CPUs.
2) Per the serialization of stopper callbacks, the return value of
   is_cpu_allowed(p, cpu) is stable within a migration_cpu_stop()
   invocation for any p allowed on *online* CPUs: migration_cpu_stop()
   has to complete before any CPU can do an online -> !online transition.

  (I would also add here that I don't expect any task to be migrated to an
  !active && online dying CPU via migration_cpu_stop(): the only tasks
  allowed on such CPUs are pcpu kthreads, which will be parked rather than
  migrated, and migrate_disabled() tasks which, unsurprisingly, aren't
  migrated - they just get to *stay* on their CPU longer than regular tasks)

3) Per 1) + 2), the move_queued_task() call in migration_cpu_stop() cannot
   silently fail.


4) No task other than a hotplug thread or a stopper thread may run on a
   CPU beyond CPUHP_AP_SCHED_WAIT_EMPTY.teardown()
5) Per 4), no (queued) task p passed to migration_cpu_stop() can have
   task_cpu(p) be a dying CPU with hotplug state below
   CPUHP_AP_SCHED_WAIT_EMPTY.
6) Per 5), migration_cpu_stop() cannot invoke stop_one_cpu_nowait() on a
   CPU with a parked stopper.
7) Per 6), migration_cpu_stop() self-queuing cannot be silently discarded
   and will always end up executed

Per 3) + 7), we're good?

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-06-24  9:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-26 20:57 [PATCH 0/2] sched: SCA vs hotplug vs stopper races fixes Valentin Schneider
2021-05-26 20:57 ` [PATCH 1/2] sched: Don't defer CPU pick to migration_cpu_stop() Valentin Schneider
2021-06-01 14:04   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2021-05-26 20:57 ` [PATCH 2/2] sched: Plug race between SCA, hotplug and migration_cpu_stop() Valentin Schneider
2021-05-27  9:12   ` Peter Zijlstra
2021-06-01 16:59   ` Valentin Schneider
2021-06-02 15:26     ` Will Deacon
2021-06-02 19:43       ` Valentin Schneider
2021-06-22 13:51         ` Will Deacon
2021-06-24  9:33           ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).