All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: Fix affine_move_task()
@ 2021-02-13 12:50 Peter Zijlstra
  2021-02-16 17:34 ` Valentin Schneider
  0 siblings, 1 reply; 2+ messages in thread
From: Peter Zijlstra @ 2021-02-13 12:50 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar, Thomas Gleixner
  Cc: Valentin Schneider, Vincent Guittot, Mel Gorman,
	Dietmar Eggemann, linux-kernel


When affine_move_task(p) is called on a running task @p, which is not
otherwise already changing affinity, we'll first set
p->migration_pending and then do:

	 stop_one_cpu(cpu_of_rq(rq), migration_cpu_stop, &arg);

This then gets us to migration_cpu_stop() running on the CPU that was
previously running our victim task @p.

If we find that our task is no longer on that runqueue (this can
happen because of a concurrent migration due to load-balance etc.),
then we'll end up at the:

	} else if (dest_cpu < 1 || pending) {

branch. Which we'll take because we set pending earlier. Here we first
check if the task @p has already satisfied the affinity constraints,
if so we bail early [A]. Otherwise we'll reissue migration_cpu_stop()
onto the CPU that is now hosting our task @p:

	stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
			    &pending->arg, &pending->stop_work);

Except, we've never initialized pending->arg, which will be all 0s.

This then results in running migration_cpu_stop() on the next CPU with
arg->p == NULL, which gives the by now obvious result of fireworks.

The cure is to change affine_move_task() to always use pending->arg,
furthermore we can use the exact same pattern as the
SCA_MIGRATE_ENABLE case, since we'll block on the pending->done
completion anyway, no point in adding yet another completion in
stop_one_cpu().

This then gives a clear distinction between the two
migration_cpu_stop() use cases:

  - sched_exec() / migrate_task_to() : arg->pending == NULL
  - affine_move_task() : arg->pending != NULL;

And we can have it ignore p->migration_pending when !arg->pending. Any
stop work from sched_exec() / migrate_task_to() is in addition to stop
works from affine_move_task(), which will be sufficient to issue the
completion.


NOTES:

 - I've not been able to reproduce this crash on any of my machines
   without first removing the early termination condition [A] above.
   Doing this is a functional NOP but obviously widens up the window.

 - With the check [A] removed I can consistently hit the NULL deref
   and the below patch reliably cures it.

 - The original reporter says that this patch cures the NULL deref
   but results in another problem, which I've not yet been able to
   make sense of and obviously have failed at reproduction as well :/

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   39 ++++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1924,6 +1924,24 @@ static int migration_cpu_stop(void *data
 	rq_lock(rq, &rf);
 
 	pending = p->migration_pending;
+	if (pending && !arg->pending) {
+		/*
+		 * This happens from sched_exec() and migrate_task_to(),
+		 * neither of them care about pending and just want a task to
+		 * maybe move about.
+		 *
+		 * Even if there is a pending, we can ignore it, since
+		 * affine_move_task() will have it's own stop_work's in flight
+		 * which will manage the completion.
+		 *
+		 * Notably, pending doesn't need to match arg->pending. This can
+		 * happen when tripple concurrent affine_move_task() first sets
+		 * pending, then clears pending and eventually sets another
+		 * pending.
+		 */
+		pending = NULL;
+	}
+
 	/*
 	 * If task_rq(p) != rq, it cannot be migrated here, because we're
 	 * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
@@ -2196,10 +2214,6 @@ static int affine_move_task(struct rq *r
 			    int dest_cpu, unsigned int flags)
 {
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
-	struct migration_arg arg = {
-		.task = p,
-		.dest_cpu = dest_cpu,
-	};
 	bool complete = false;
 
 	/* Can the task run on the task's current CPU? If so, we're done */
@@ -2237,6 +2251,12 @@ static int affine_move_task(struct rq *r
 			/* Install the request */
 			refcount_set(&my_pending.refs, 1);
 			init_completion(&my_pending.done);
+			my_pending.arg = (struct migration_arg) {
+				.task = p,
+				.dest_cpu = -1,
+				.pending = &my_pending,
+			};
+
 			p->migration_pending = &my_pending;
 		} else {
 			pending = p->migration_pending;
@@ -2267,12 +2287,6 @@ static int affine_move_task(struct rq *r
 		p->migration_flags &= ~MDF_PUSH;
 		task_rq_unlock(rq, p, rf);
 
-		pending->arg = (struct migration_arg) {
-			.task = p,
-			.dest_cpu = -1,
-			.pending = pending,
-		};
-
 		stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
 				    &pending->arg, &pending->stop_work);
 
@@ -2285,8 +2299,11 @@ static int affine_move_task(struct rq *r
 		 * is_migration_disabled(p) checks to the stopper, which will
 		 * run on the same CPU as said p.
 		 */
+		refcount_inc(&pending->refs); /* pending->{arg,stop_work} */
 		task_rq_unlock(rq, p, rf);
-		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+
+		stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
+				    &pending->arg, &pending->stop_work);
 
 	} else {
 

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC][PATCH] sched: Fix affine_move_task()
  2021-02-13 12:50 [RFC][PATCH] sched: Fix affine_move_task() Peter Zijlstra
@ 2021-02-16 17:34 ` Valentin Schneider
  0 siblings, 0 replies; 2+ messages in thread
From: Valentin Schneider @ 2021-02-16 17:34 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds, Ingo Molnar, Thomas Gleixner
  Cc: Vincent Guittot, Mel Gorman, Dietmar Eggemann, linux-kernel

On 13/02/21 13:50, Peter Zijlstra wrote:
> When affine_move_task(p) is called on a running task @p, which is not
> otherwise already changing affinity, we'll first set
> p->migration_pending and then do:
>
> 	 stop_one_cpu(cpu_of_rq(rq), migration_cpu_stop, &arg);
>
> This then gets us to migration_cpu_stop() running on the CPU that was
> previously running our victim task @p.
>
> If we find that our task is no longer on that runqueue (this can
> happen because of a concurrent migration due to load-balance etc.),
> then we'll end up at the:
>
> 	} else if (dest_cpu < 1 || pending) {
>
> branch. Which we'll take because we set pending earlier. Here we first
> check if the task @p has already satisfied the affinity constraints,
> if so we bail early [A]. Otherwise we'll reissue migration_cpu_stop()
> onto the CPU that is now hosting our task @p:
>
> 	stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> 			    &pending->arg, &pending->stop_work);
>
> Except, we've never initialized pending->arg, which will be all 0s.
>
> This then results in running migration_cpu_stop() on the next CPU with
> arg->p == NULL, which gives the by now obvious result of fireworks.
>
> The cure is to change affine_move_task() to always use pending->arg,
> furthermore we can use the exact same pattern as the
> SCA_MIGRATE_ENABLE case, since we'll block on the pending->done
> completion anyway, no point in adding yet another completion in
> stop_one_cpu().
>
> This then gives a clear distinction between the two
> migration_cpu_stop() use cases:
>
>   - sched_exec() / migrate_task_to() : arg->pending == NULL
>   - affine_move_task() : arg->pending != NULL;
>
> And we can have it ignore p->migration_pending when !arg->pending. Any
> stop work from sched_exec() / migrate_task_to() is in addition to stop
> works from affine_move_task(), which will be sufficient to issue the
> completion.
>
>
> NOTES:
>
>  - I've not been able to reproduce this crash on any of my machines
>    without first removing the early termination condition [A] above.
>    Doing this is a functional NOP but obviously widens up the window.
>

FWIW although I mistakenly didn't model any distinction between arg &
pending->arg, I did "hit" this path in TLA+ [1]

>  - With the check [A] removed I can consistently hit the NULL deref
>    and the below patch reliably cures it.
>
>  - The original reporter says that this patch cures the NULL deref
>    but results in another problem, which I've not yet been able to
>    make sense of and obviously have failed at reproduction as well :/
>

Do you have a link to that? I fumbled around my mails but haven't seen
anything.

> Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

The below looks good to me, I'll whack it into the TLA+ machine and see
where it goes. While at it I need to mention I *have* been cleaning it up
for upstreaming, but have hit [1] in the process...

[1]: http://lore.kernel.org/r/20210127193035.13789-1-valentin.schneider@arm.com

> ---
>  kernel/sched/core.c |   39 ++++++++++++++++++++++++++++-----------
>  1 file changed, 28 insertions(+), 11 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1924,6 +1924,24 @@ static int migration_cpu_stop(void *data
>  	rq_lock(rq, &rf);
>  
>  	pending = p->migration_pending;
> +	if (pending && !arg->pending) {
> +		/*
> +		 * This happens from sched_exec() and migrate_task_to(),
> +		 * neither of them care about pending and just want a task to
> +		 * maybe move about.
> +		 *
> +		 * Even if there is a pending, we can ignore it, since
> +		 * affine_move_task() will have it's own stop_work's in flight
> +		 * which will manage the completion.
> +		 *
> +		 * Notably, pending doesn't need to match arg->pending. This can
> +		 * happen when tripple concurrent affine_move_task() first sets
                               ^^^^^^
                           s/tripple/triple

> +		 * pending, then clears pending and eventually sets another
> +		 * pending.
> +		 */
> +		pending = NULL;
> +	}
> +
>  	/*
>  	 * If task_rq(p) != rq, it cannot be migrated here, because we're
>  	 * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-02-16 17:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-13 12:50 [RFC][PATCH] sched: Fix affine_move_task() Peter Zijlstra
2021-02-16 17:34 ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.