linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sched_setscheduler() vs idle_balance() race
@ 2015-05-28  7:43 Mike Galbraith
  2015-05-28 11:51 ` Peter Zijlstra
  2015-05-28 13:53 ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Galbraith @ 2015-05-28  7:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar

Hi Peter,

I'm not seeing what prevents pull_task() from yanking a task out from
under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
wreckage all over my bugzilla mbox isn't seeing it either ;-)

Scenario: rt task forks, wakes child to CPU foo, immediately tries to
change child to fair class, calls switched_from_rt(), that leads to
pull_rt_task() -> double_lock_balance() which momentarily drops child's
rq->lock, letting some prick doing idle balancing over on CPU bar in to
migrate the child.  Rt parent then calls switched_to_fair(), and box
explodes when we use the passed rq as if the child still lived there.

I sent a patchlet to verify that the diagnosis is really really correct
(can_migrate_task() says no if ->pi_lock is held), but I think it is,
the 8x10 color glossy with circles and arrows clearly shows both tasks
with their grubby mitts on that child at the same time, each thinking it
has that child locked down tight.

Not seeing what should prevent that in mainline either, I'll just ask
while I wait to (hopefully) hear "yup, all better".

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28  7:43 sched_setscheduler() vs idle_balance() race Mike Galbraith
@ 2015-05-28 11:51 ` Peter Zijlstra
  2015-05-28 12:04   ` Mike Galbraith
  2015-05-28 13:53 ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-05-28 11:51 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, May 28, 2015 at 09:43:52AM +0200, Mike Galbraith wrote:
> Hi Peter,
> 
> I'm not seeing what prevents pull_task() from yanking a task out from
> under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
> wreckage all over my bugzilla mbox isn't seeing it either ;-)
> 
> Scenario: rt task forks, wakes child to CPU foo, immediately tries to
> change child to fair class, calls switched_from_rt(), that leads to
> pull_rt_task() -> double_lock_balance() which momentarily drops child's
> rq->lock, letting some prick doing idle balancing over on CPU bar in to
> migrate the child.  Rt parent then calls switched_to_fair(), and box
> explodes when we use the passed rq as if the child still lived there.
> 
> I sent a patchlet to verify that the diagnosis is really really correct
> (can_migrate_task() says no if ->pi_lock is held), but I think it is,
> the 8x10 color glossy with circles and arrows clearly shows both tasks
> with their grubby mitts on that child at the same time, each thinking it
> has that child locked down tight.
> 
> Not seeing what should prevent that in mainline either, I'll just ask
> while I wait to (hopefully) hear "yup, all better".

The last patch to come close is 67dfa1b756f2 ("sched/deadline: Implement
cancel_dl_timer() to use in switched_from_dl()")

Which places the comment /* Possible rq-lock hole */ between
switched_from() and switched_to().

Which is exactly the hole you mean, right?

And that commit talks about how all that is 'safe' because all scheduler
operations take ->pi_lock, which is true, except for load-balancing,
which only uses rq->lock.

Furthermore, we call check_class_changed() _after_ we enqueue the task
on the new class, so balancing can indeed occur.

Lemme go stare at this; ideally we'd call check_class_changed() at
__setscheduler() time where the task is off all rqs, but I suspect
there's 'obvious' problems with that..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 11:51 ` Peter Zijlstra
@ 2015-05-28 12:04   ` Mike Galbraith
  2015-05-28 12:06     ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2015-05-28 12:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, 2015-05-28 at 13:51 +0200, Peter Zijlstra wrote:
> On Thu, May 28, 2015 at 09:43:52AM +0200, Mike Galbraith wrote:
> > Hi Peter,
> > 
> > I'm not seeing what prevents pull_task() from yanking a task out from
> > under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
> > wreckage all over my bugzilla mbox isn't seeing it either ;-)
> > 
> > Scenario: rt task forks, wakes child to CPU foo, immediately tries to
> > change child to fair class, calls switched_from_rt(), that leads to
> > pull_rt_task() -> double_lock_balance() which momentarily drops child's
> > rq->lock, letting some prick doing idle balancing over on CPU bar in to
> > migrate the child.  Rt parent then calls switched_to_fair(), and box
> > explodes when we use the passed rq as if the child still lived there.
> > 
> > I sent a patchlet to verify that the diagnosis is really really correct
> > (can_migrate_task() says no if ->pi_lock is held), but I think it is,
> > the 8x10 color glossy with circles and arrows clearly shows both tasks
> > with their grubby mitts on that child at the same time, each thinking it
> > has that child locked down tight.
> > 
> > Not seeing what should prevent that in mainline either, I'll just ask
> > while I wait to (hopefully) hear "yup, all better".
> 
> The last patch to come close is 67dfa1b756f2 ("sched/deadline: Implement
> cancel_dl_timer() to use in switched_from_dl()")
> 
> Which places the comment /* Possible rq-lock hole */ between
> switched_from() and switched_to().
> 
> Which is exactly the hole you mean, right?

Yeah, but that hole is way older than dl.  Box falling into it is
running SLE11, which is.. well, still somewhat resembles 3.0.

> And that commit talks about how all that is 'safe' because all scheduler
> operations take ->pi_lock, which is true, except for load-balancing,
> which only uses rq->lock.

Yes.  The child CPU scheduled, so child was no longer ->curr, making it
eligible given !hot or too many failed attempts.

Oh, btw, we pull tasks that are about to schedule off too.  While first
trying to out wth was going on, I sprinkled some checks, and "NOPE, why
bother" is the only one to appear, quite a lot.

> Furthermore, we call check_class_changed() _after_ we enqueue the task
> on the new class, so balancing can indeed occur.
> 
> Lemme go stare at this; ideally we'd call check_class_changed() at
> __setscheduler() time where the task is off all rqs, but I suspect
> there's 'obvious' problems with that..

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 12:04   ` Mike Galbraith
@ 2015-05-28 12:06     ` Peter Zijlstra
  2015-05-28 12:32       ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-05-28 12:06 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, May 28, 2015 at 02:04:21PM +0200, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 13:51 +0200, Peter Zijlstra wrote:

> > Which is exactly the hole you mean, right?
> 
> Yeah, but that hole is way older than dl.  Box falling into it is
> running SLE11, which is.. well, still somewhat resembles 3.0.

Oh sure, that patch just recognised there was a hole there and put a
comment in. We further failed to spot the fail you just did.

Right mess though, I think I already found some of those 'obvious'
reasons mentioned in the previous email.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 12:06     ` Peter Zijlstra
@ 2015-05-28 12:32       ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2015-05-28 12:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, 2015-05-28 at 14:06 +0200, Peter Zijlstra wrote:
> On Thu, May 28, 2015 at 02:04:21PM +0200, Mike Galbraith wrote:
> > On Thu, 2015-05-28 at 13:51 +0200, Peter Zijlstra wrote:
> 
> > > Which is exactly the hole you mean, right?
> > 
> > Yeah, but that hole is way older than dl.  Box falling into it is
> > running SLE11, which is.. well, still somewhat resembles 3.0.
> 
> Oh sure, that patch just recognised there was a hole there and put a
> comment in. We further failed to spot the fail you just did.

Happens.  Even with a perfect trap set and triggered, evidence right
under my nose, I'm so used to that code that I didn't see the obvious
until I took a second look.. and had that 'you dumbass' moment ;-)

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28  7:43 sched_setscheduler() vs idle_balance() race Mike Galbraith
  2015-05-28 11:51 ` Peter Zijlstra
@ 2015-05-28 13:53 ` Peter Zijlstra
  2015-05-28 14:54   ` Mike Galbraith
                     ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-05-28 13:53 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, May 28, 2015 at 09:43:52AM +0200, Mike Galbraith wrote:
> Hi Peter,
> 
> I'm not seeing what prevents pull_task() from yanking a task out from
> under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
> wreckage all over my bugzilla mbox isn't seeing it either ;-)

Say, how easy can that thing be reproduced?

The below is compile tested only, but it might just work if I didn't
miss anything :-)


---
 kernel/sched/core.c | 137 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 77 insertions(+), 60 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4eec60757b16..28f1ddc0bef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1000,22 +1000,6 @@ inline int task_curr(const struct task_struct *p)
 	return cpu_curr(task_cpu(p)) == p;
 }
 
-/*
- * Can drop rq->lock because from sched_class::switched_from() methods drop it.
- */
-static inline void check_class_changed(struct rq *rq, struct task_struct *p,
-				       const struct sched_class *prev_class,
-				       int oldprio)
-{
-	if (prev_class != p->sched_class) {
-		if (prev_class->switched_from)
-			prev_class->switched_from(rq, p);
-		/* Possble rq->lock 'hole'.  */
-		p->sched_class->switched_to(rq, p);
-	} else if (oldprio != p->prio || dl_task(p))
-		p->sched_class->prio_changed(rq, p, oldprio);
-}
-
 void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 {
 	const struct sched_class *class;
@@ -3075,12 +3059,38 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
 	p->prio = prio;
 
+	if (prev_class != p->sched_class) {
+		prev_class->switched_from(rq, p);
+		/*
+		 * switched_from() is allowed to drop @rq->lock; which opens a
+		 * race against load-balancing, however since @p is not
+		 * currently enqueued it is invisible to the load-balancer.
+		 *
+		 * double check @p is still where we thought it was.
+		 */
+		WARN_ON_ONCE(task_rq(p) != rq);
+	}
+
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, enqueue_flag);
 
-	check_class_changed(rq, p, prev_class, oldprio);
+	/*
+	 * Both switched_to() and prio_changed() are allowed to drop @rq->lock;
+	 * which opens a race against load-balancing, and since @p is now
+	 * enqueued it can indeed be subject to this.
+	 *
+	 * This means that any balancing done by these functions must double
+	 * check a task's rq.
+	 */
+	if (prev_class != p->sched_class)
+		p->sched_class->switched_to(rq, p);
+	else if (oldprio != p->prio || dl_task(p))
+		p->sched_class->prio_changed(rq, p, oldprio);
+	/*
+	 * It further means we should not rely on @p's rq from here on.
+	 */
 out_unlock:
 	__task_rq_unlock(rq);
 }
@@ -3420,7 +3430,7 @@ static bool dl_param_changed(struct task_struct *p,
 
 static int __sched_setscheduler(struct task_struct *p,
 				const struct sched_attr *attr,
-				bool user)
+				bool user, bool pi)
 {
 	int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
 		      MAX_RT_PRIO - 1 - attr->sched_priority;
@@ -3606,18 +3616,20 @@ static int __sched_setscheduler(struct task_struct *p,
 	p->sched_reset_on_fork = reset_on_fork;
 	oldprio = p->prio;
 
-	/*
-	 * Take priority boosted tasks into account. If the new
-	 * effective priority is unchanged, we just store the new
-	 * normal parameters and do not touch the scheduler class and
-	 * the runqueue. This will be done when the task deboost
-	 * itself.
-	 */
-	new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
-	if (new_effective_prio == oldprio) {
-		__setscheduler_params(p, attr);
-		task_rq_unlock(rq, p, &flags);
-		return 0;
+	if (pi) {
+		/*
+		 * Take priority boosted tasks into account. If the new
+		 * effective priority is unchanged, we just store the new
+		 * normal parameters and do not touch the scheduler class and
+		 * the runqueue. This will be done when the task deboost
+		 * itself.
+		 */
+		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
+		if (new_effective_prio == oldprio) {
+			__setscheduler_params(p, attr);
+			task_rq_unlock(rq, p, &flags);
+			return 0;
+		}
 	}
 
 	queued = task_on_rq_queued(p);
@@ -3628,7 +3640,19 @@ static int __sched_setscheduler(struct task_struct *p,
 		put_prev_task(rq, p);
 
 	prev_class = p->sched_class;
-	__setscheduler(rq, p, attr, true);
+	__setscheduler(rq, p, attr, pi);
+
+	if (prev_class != p->sched_class) {
+		prev_class->switched_from(rq, p);
+		/*
+		 * switched_from() is allowed to drop @rq->lock; which opens a
+		 * race against load-balancing, however since @p is not
+		 * currently enqueued it is invisible to the load-balancer.
+		 *
+		 * double check @p is still where we thought it was.
+		 */
+		WARN_ON_ONCE(task_rq(p) != rq);
+	}
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -3640,10 +3664,25 @@ static int __sched_setscheduler(struct task_struct *p,
 		enqueue_task(rq, p, oldprio <= p->prio ? ENQUEUE_HEAD : 0);
 	}
 
-	check_class_changed(rq, p, prev_class, oldprio);
+	/*
+	 * Both switched_to() and prio_changed() are allowed to drop @rq->lock;
+	 * which opens a race against load-balancing, and since @p is now
+	 * enqueued it can indeed be subject to this.
+	 *
+	 * This means that any balancing done by these functions must double
+	 * check a task's rq.
+	 */
+	if (prev_class != p->sched_class)
+		p->sched_class->switched_to(rq, p);
+	else if (oldprio != p->prio || dl_task(p))
+		p->sched_class->prio_changed(rq, p, oldprio);
+	/*
+	 * It further means we should not rely on @p's rq from here on.
+	 */
 	task_rq_unlock(rq, p, &flags);
 
-	rt_mutex_adjust_pi(p);
+	if (pi)
+		rt_mutex_adjust_pi(p);
 
 	return 0;
 }
@@ -3664,7 +3703,7 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
 		attr.sched_policy = policy;
 	}
 
-	return __sched_setscheduler(p, &attr, check);
+	return __sched_setscheduler(p, &attr, check, true);
 }
 /**
  * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
@@ -3685,7 +3724,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 
 int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
 {
-	return __sched_setscheduler(p, attr, true);
+	return __sched_setscheduler(p, attr, true, true);
 }
 EXPORT_SYMBOL_GPL(sched_setattr);
 
@@ -7346,32 +7385,12 @@ EXPORT_SYMBOL(___might_sleep);
 #endif
 
 #ifdef CONFIG_MAGIC_SYSRQ
-static void normalize_task(struct rq *rq, struct task_struct *p)
+void normalize_rt_tasks(void)
 {
-	const struct sched_class *prev_class = p->sched_class;
+	struct task_struct *g, *p;
 	struct sched_attr attr = {
 		.sched_policy = SCHED_NORMAL,
 	};
-	int old_prio = p->prio;
-	int queued;
-
-	queued = task_on_rq_queued(p);
-	if (queued)
-		dequeue_task(rq, p, 0);
-	__setscheduler(rq, p, &attr, false);
-	if (queued) {
-		enqueue_task(rq, p, 0);
-		resched_curr(rq);
-	}
-
-	check_class_changed(rq, p, prev_class, old_prio);
-}
-
-void normalize_rt_tasks(void)
-{
-	struct task_struct *g, *p;
-	unsigned long flags;
-	struct rq *rq;
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, p) {
@@ -7398,9 +7417,7 @@ void normalize_rt_tasks(void)
 			continue;
 		}
 
-		rq = task_rq_lock(p, &flags);
-		normalize_task(rq, p);
-		task_rq_unlock(rq, p, &flags);
+		__sched_setscheduler(p, &attr, false, false);
 	}
 	read_unlock(&tasklist_lock);
 }

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 13:53 ` Peter Zijlstra
@ 2015-05-28 14:54   ` Mike Galbraith
  2015-05-28 15:24     ` Peter Zijlstra
  2015-05-28 16:59   ` Kirill Tkhai
  2015-05-30 13:08   ` Mike Galbraith
  2 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2015-05-28 14:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, 2015-05-28 at 15:53 +0200, Peter Zijlstra wrote:

> Say, how easy can that thing be reproduced?

It doesn't seem to take the reporter very long to blow their box up.
What they're doing must be pretty darn uncommon though.

I have the source to a test application, no destructions to go with it
though, and little time to play around.  It has to be installed and
whatnot, it's not the desired make-it-go-boom.c.

> The below is compile tested only, but it might just work if I didn't
> miss anything :-)

I'll take it for a spin, and take a peek at the application.

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 14:54   ` Mike Galbraith
@ 2015-05-28 15:24     ` Peter Zijlstra
  2015-05-29 18:30       ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-05-28 15:24 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, May 28, 2015 at 04:54:26PM +0200, Mike Galbraith wrote:

> > The below is compile tested only, but it might just work if I didn't
> > miss anything :-)
> 
> I'll take it for a spin, and take a peek at the application.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 13:53 ` Peter Zijlstra
  2015-05-28 14:54   ` Mike Galbraith
@ 2015-05-28 16:59   ` Kirill Tkhai
  2015-05-30 13:08   ` Mike Galbraith
  2 siblings, 0 replies; 18+ messages in thread
From: Kirill Tkhai @ 2015-05-28 16:59 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

В Чт, 28/05/2015 в 15:53 +0200, Peter Zijlstra пишет:
On Thu, May 28, 2015 at 09:43:52AM +0200, Mike Galbraith wrote:
> > Hi Peter,
> > 
> > I'm not seeing what prevents pull_task() from yanking a task out from
> > under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
> > wreckage all over my bugzilla mbox isn't seeing it either ;-)
> 
> Say, how easy can that thing be reproduced?
> 
> The below is compile tested only, but it might just work if I didn't
> miss anything :-)
> 
> 
> ---
>  kernel/sched/core.c | 137 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 77 insertions(+), 60 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4eec60757b16..28f1ddc0bef2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1000,22 +1000,6 @@ inline int task_curr(const struct task_struct *p)
>  	return cpu_curr(task_cpu(p)) == p;
>  }
>  
> -/*
> - * Can drop rq->lock because from sched_class::switched_from() methods drop it.
> - */
> -static inline void check_class_changed(struct rq *rq, struct task_struct *p,
> -				       const struct sched_class *prev_class,
> -				       int oldprio)
> -{
> -	if (prev_class != p->sched_class) {
> -		if (prev_class->switched_from)
> -			prev_class->switched_from(rq, p);
> -		/* Possble rq->lock 'hole'.  */
> -		p->sched_class->switched_to(rq, p);
> -	} else if (oldprio != p->prio || dl_task(p))
> -		p->sched_class->prio_changed(rq, p, oldprio);
> -}
> -
>  void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	const struct sched_class *class;
> @@ -3075,12 +3059,38 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
>  
>  	p->prio = prio;
>  
> +	if (prev_class != p->sched_class) {
> +		prev_class->switched_from(rq, p);
>
 
switched_from_XXX() doesn't depend on on_YYY_rq state, so, yeah, this should be safe.

+		/*
> +		 * switched_from() is allowed to drop @rq->lock; which opens a
> +		 * race against load-balancing, however since @p is not
> +		 * currently enqueued it is invisible to the load-balancer.
> +		 *
> +		 * double check @p is still where we thought it was.
> +		 */
> +		WARN_ON_ONCE(task_rq(p) != rq);
> +	}
> +
>  	if (running)
>  		p->sched_class->set_curr_task(rq);
>  	if (queued)
>  		enqueue_task(rq, p, enqueue_flag);
>  
> -	check_class_changed(rq, p, prev_class, oldprio);
> +	/*
> +	 * Both switched_to() and prio_changed() are allowed to drop @rq->lock;
> +	 * which opens a race against load-balancing, and since @p is now
> +	 * enqueued it can indeed be subject to this.
> +	 *
> +	 * This means that any balancing done by these functions must double
> +	 * check a task's rq.
> +	 */
> +	if (prev_class != p->sched_class)
> +		p->sched_class->switched_to(rq, p);
> +	else if (oldprio != p->prio || dl_task(p))
> +		p->sched_class->prio_changed(rq, p, oldprio);
> +	/*
> +	 * It further means we should not rely on @p's rq from here on.
> +	 */
>  out_unlock:
>  	__task_rq_unlock(rq);
>  }
> @@ -3420,7 +3430,7 @@ static bool dl_param_changed(struct task_struct *p,
>  
>  static int __sched_setscheduler(struct task_struct *p,
>  				const struct sched_attr *attr,
> -				bool user)
> +				bool user, bool pi)
>  {
>  	int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
>  		      MAX_RT_PRIO - 1 - attr->sched_priority;
> @@ -3606,18 +3616,20 @@ static int __sched_setscheduler(struct task_struct *p,
>  	p->sched_reset_on_fork = reset_on_fork;
>  	oldprio = p->prio;
>  
> -	/*
> -	 * Take priority boosted tasks into account. If the new
> -	 * effective priority is unchanged, we just store the new
> -	 * normal parameters and do not touch the scheduler class and
> -	 * the runqueue. This will be done when the task deboost
> -	 * itself.
> -	 */
> -	new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
> -	if (new_effective_prio == oldprio) {
> -		__setscheduler_params(p, attr);
> -		task_rq_unlock(rq, p, &flags);
> -		return 0;
> +	if (pi) {
> +		/*
> +		 * Take priority boosted tasks into account. If the new
> +		 * effective priority is unchanged, we just store the new
> +		 * normal parameters and do not touch the scheduler class and
> +		 * the runqueue. This will be done when the task deboost
> +		 * itself.
> +		 */
> +		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
> +		if (new_effective_prio == oldprio) {
> +			__setscheduler_params(p, attr);
> +			task_rq_unlock(rq, p, &flags);
> +			return 0;
> +		}
>  	}
>  
>  	queued = task_on_rq_queued(p);
> @@ -3628,7 +3640,19 @@ static int __sched_setscheduler(struct task_struct *p,
>  		put_prev_task(rq, p);
>  
>  	prev_class = p->sched_class;
> -	__setscheduler(rq, p, attr, true);
> +	__setscheduler(rq, p, attr, pi);
> +
> +	if (prev_class != p->sched_class) {
> +		prev_class->switched_from(rq, p);
> +		/*
> +		 * switched_from() is allowed to drop @rq->lock; which opens a
> +		 * race against load-balancing, however since @p is not
> +		 * currently enqueued it is invisible to the load-balancer.
> +		 *
> +		 * double check @p is still where we thought it was.
> +		 */
> +		WARN_ON_ONCE(task_rq(p) != rq);
> +	}
>  
>  	if (running)
>  		p->sched_class->set_curr_task(rq);
> @@ -3640,10 +3664,25 @@ static int __sched_setscheduler(struct task_struct *p,
>  		enqueue_task(rq, p, oldprio <= p->prio ? ENQUEUE_HEAD : 0);
>  	}
>  
> -	check_class_changed(rq, p, prev_class, oldprio);
> +	/*
> +	 * Both switched_to() and prio_changed() are allowed to drop @rq->lock;
> +	 * which opens a race against load-balancing, and since @p is now
> +	 * enqueued it can indeed be subject to this.
> +	 *
> +	 * This means that any balancing done by these functions must double
> +	 * check a task's rq.
> +	 */
> +	if (prev_class != p->sched_class)
> +		p->sched_class->switched_to(rq, p);
> +	else if (oldprio != p->prio || dl_task(p))
> +		p->sched_class->prio_changed(rq, p, oldprio);
> +	/*
> +	 * It further means we should not rely on @p's rq from here on.
> +	 */
>  	task_rq_unlock(rq, p, &flags);
>  
> -	rt_mutex_adjust_pi(p);
> +	if (pi)
> +		rt_mutex_adjust_pi(p);
>  
>  	return 0;
>  }
> @@ -3664,7 +3703,7 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
>  		attr.sched_policy = policy;
>  	}
>  
> -	return __sched_setscheduler(p, &attr, check);
> +	return __sched_setscheduler(p, &attr, check, true);
>  }
>  /**
>   * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
> @@ -3685,7 +3724,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
>  
>  int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
>  {
> -	return __sched_setscheduler(p, attr, true);
> +	return __sched_setscheduler(p, attr, true, true);
>  }
>  EXPORT_SYMBOL_GPL(sched_setattr);
>  
> @@ -7346,32 +7385,12 @@ EXPORT_SYMBOL(___might_sleep);
>  #endif
>  
>  #ifdef CONFIG_MAGIC_SYSRQ
> -static void normalize_task(struct rq *rq, struct task_struct *p)
> +void normalize_rt_tasks(void)
>  {
> -	const struct sched_class *prev_class = p->sched_class;
> +	struct task_struct *g, *p;
>  	struct sched_attr attr = {
>  		.sched_policy = SCHED_NORMAL,
>  	};
> -	int old_prio = p->prio;
> -	int queued;
> -
> -	queued = task_on_rq_queued(p);
> -	if (queued)
> -		dequeue_task(rq, p, 0);
> -	__setscheduler(rq, p, &attr, false);
> -	if (queued) {
> -		enqueue_task(rq, p, 0);
> -		resched_curr(rq);
> -	}
> -
> -	check_class_changed(rq, p, prev_class, old_prio);
> -}
> -
> -void normalize_rt_tasks(void)
> -{
> -	struct task_struct *g, *p;
> -	unsigned long flags;
> -	struct rq *rq;
>  
>  	read_lock(&tasklist_lock);
>  	for_each_process_thread(g, p) {
> @@ -7398,9 +7417,7 @@ void normalize_rt_tasks(void)
>  			continue;
>  		}
>  
> -		rq = task_rq_lock(p, &flags);
> -		normalize_task(rq, p);
> -		task_rq_unlock(rq, p, &flags);
> +		__sched_setscheduler(p, &attr, false, false);
>  	}
>  	read_unlock(&tasklist_lock);
>  }
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 15:24     ` Peter Zijlstra
@ 2015-05-29 18:30       ` Mike Galbraith
  2015-05-29 18:48         ` Mike Galbraith
  2015-06-01  8:16         ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Galbraith @ 2015-05-29 18:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, 2015-05-28 at 17:24 +0200, Peter Zijlstra wrote:
> On Thu, May 28, 2015 at 04:54:26PM +0200, Mike Galbraith wrote:
> 
> > > The below is compile tested only, but it might just work if I didn't
> > > miss anything :-)
> > 
> > I'll take it for a spin, and take a peek at the application.
> 
> Thanks!

It took quite a bit longer than I thought it would, but I finally
managed to cobble a standalone testcase together that brings nearly
instant gratification on my 8 socket DL980.  Patched kernel explodes, so
first cut ain't quite ready to ship ;-)

I applied say no to migration if ->pi_lock is held, and otherwise toxic
testcase was rendered harmless, so seems it is a hole in the patch.

Here's the burp, I haven't rummaged around at all yet.

[  286.105446] ------------[ cut here ]------------
[  286.151163] kernel BUG at kernel/sched/rt.c:986!
[  286.203404] invalid opcode: 0000 [#1] SMP 
[  286.249093] Dumping ftrace buffer:
[  286.288337]    (ftrace buffer empty)
[  286.328403] Modules linked in: edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop md_mod dm_mod iTCO_wdt gpio_ich iTCO_vendor_support ipmi_ssif joydev i7core_edac ipmi_si lpc_ich hpilo hid_generic netxen_nic hpwdt shpchp sr_mod ehci_pci mfd_core pcspkr bnx2 edac_core ipmi_msghandler cdrom sg pcc_cpufreq 8250_fintek acpi_cpufreq acpi_power_meter button usbhid uhci_hcd ehci_hcd usbcore thermal usb_common processor scsi_dh_hp_sw scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh ata_generic ata_piix hpsa cciss
[  286.855938] CPU: 3 PID: 6893 Comm: massive_intr_x Not tainted 4.1.0-default #2
[  286.933673] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[  287.009379] task: ffff8802717bc4d0 ti: ffff8802715b4000 task.ti: ffff8802715b4000
[  287.089247] RIP: 0010:[<ffffffff810a75d4>]  [<ffffffff810a75d4>] dequeue_top_rt_rq+0x44/0x50
[  287.184723] RSP: 0018:ffff8802715b7d98  EFLAGS: 00010046
[  287.244782] RAX: ffff880277316480 RBX: ffff88007a4ba788 RCX: 00000000000025c7
[  287.326088] RDX: 0000000000000000 RSI: ffff88007a4ba590 RDI: ffff880277316618
[  287.407138] RBP: ffff8802715b7d98 R08: ffffffff81c3ff00 R09: 0000000000001aed
[  287.487730] R10: ffff88007a4ba590 R11: 0000000000000001 R12: ffff880277316480
[  287.568328] R13: ffff880277316c90 R14: ffff8802715b7ed8 R15: ffff88007a4ba590
[  287.649732] FS:  00007efc0515c700(0000) GS:ffff8802766c0000(0000) knlGS:0000000000000000
[  287.741131] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  287.805467] CR2: ffffffffff600400 CR3: 000000026f6a1000 CR4: 00000000000007e0
[  287.889403] Stack:
[  287.912171]  ffff8802715b7dc8 ffffffff810a81fc ffff88007a4ba788 ffff880277316480
[  287.995061]  ffff880277316c90 ffff8802715b7ed8 ffff8802715b7de8 ffffffff810a909f
[  288.079516]  ffff880277316480 ffff88007a4ba590 ffff8802715b7e18 ffffffff810a9691
[  288.163784] Call Trace:
[  288.193691]  [<ffffffff810a81fc>] dequeue_rt_stack+0x3c/0x350
[  288.260484]  [<ffffffff810a909f>] dequeue_rt_entity+0x1f/0x80
[  288.330554]  [<ffffffff810a9691>] dequeue_task_rt+0x31/0x80
[  288.395212]  [<ffffffff8108e16c>] dequeue_task+0x5c/0x80
[  288.472481]  [<ffffffff81091ef5>] __sched_setscheduler+0x635/0xa50
[  288.547063]  [<ffffffff81092378>] _sched_setscheduler+0x68/0x70
[  288.613281]  [<ffffffff81092401>] do_sched_setscheduler+0x61/0xa0
[  288.681984]  [<ffffffff81094f82>] SyS_sched_setscheduler+0x12/0x30
[  288.750797]  [<ffffffff81669cb2>] system_call_fastpath+0x16/0x75
[  288.819013] Code: d7 75 26 8b 97 ac 06 00 00 85 d2 74 1a 8b 50 04 85 d2 74 17 2b 97 50 06 00 00 89 50 04 c7 87 ac 06 00 00 00 00 00 00 5d c3 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 
[  289.037988] RIP  [<ffffffff810a75d4>] dequeue_top_rt_rq+0x44/0x50
[  289.100594]  RSP <ffff8802715b7d98>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-29 18:30       ` Mike Galbraith
@ 2015-05-29 18:48         ` Mike Galbraith
  2015-06-01  8:14           ` Peter Zijlstra
  2015-06-01  8:16         ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2015-05-29 18:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai


P.S. intel_idle is not all that wonderful on this box.

-   78.31%  [kernel]       [k] _raw_spin_lock                                                                                                                                                                                              ▒
   - _raw_spin_lock                                                                                                                                                                                                                        ▒
      - 94.91% tick_broadcast_oneshot_control                                                                                                                                                                                              ▒
           intel_idle                                                                                                                                                                                                                      ▒
           cpuidle_enter_state                                                                                                                                                                                                             ▒
           cpuidle_enter                                                                                                                                                                                                                   ▒
         + cpu_startup_entry                                                                                                                                                                                                               ▒
      + 1.63% pull_rt_task                                                                                                                                                                                                                 ▒
      + 1.35% __sched_setscheduler                                                                                                                                                                                                         ▒
      + 1.11% push_rt_task.part.46                                                                                                                                                                                                         ▒
+    2.44%  [kernel]       [k] native_write_msr_safe




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-28 13:53 ` Peter Zijlstra
  2015-05-28 14:54   ` Mike Galbraith
  2015-05-28 16:59   ` Kirill Tkhai
@ 2015-05-30 13:08   ` Mike Galbraith
  2015-05-31  6:39     ` Mike Galbraith
  2015-06-01  8:19     ` Peter Zijlstra
  2 siblings, 2 replies; 18+ messages in thread
From: Mike Galbraith @ 2015-05-30 13:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Thu, 2015-05-28 at 15:53 +0200, Peter Zijlstra wrote:
> On Thu, May 28, 2015 at 09:43:52AM +0200, Mike Galbraith wrote:
> > Hi Peter,
> > 
> > I'm not seeing what prevents pull_task() from yanking a task out from
> > under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
> > wreckage all over my bugzilla mbox isn't seeing it either ;-)
> 
> Say, how easy can that thing be reproduced?
> 
> The below is compile tested only, but it might just work if I didn't
> miss anything :-)

Seems trying to make the target invisible to balancing created a new
race: dequeue target, do stuff that may drop rq->lock while it's
dequeued, target sneaks into schedule(), dequeues itself (#2), boom.

On my desktop box I meet..

crash> bt
PID: 6281   TASK: ffff880401950000  CPU: 5   COMMAND: "massive_intr_x"
 #0 [ffff8800da9d79c0] machine_kexec at ffffffff8103c428
 #1 [ffff8800da9d7a20] crash_kexec at ffffffff810c98e5
 #2 [ffff8800da9d7af0] oops_end at ffffffff81006418
 #3 [ffff8800da9d7b20] no_context at ffffffff815b4296
 #4 [ffff8800da9d7b80] __bad_area_nosemaphore at ffffffff815b4353
 #5 [ffff8800da9d7bd0] bad_area at ffffffff815b4691
 #6 [ffff8800da9d7c00] __do_page_fault at ffffffff81044eba
 #7 [ffff8800da9d7c70] do_page_fault at ffffffff8104500c
 #8 [ffff8800da9d7c80] page_fault at ffffffff815c10b2
    [exception RIP: set_next_entity+28]
    RIP: ffffffff81080bac  RSP: ffff8800da9d7d38  RFLAGS: 00010092
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 00000000044aa200
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff88041ed55968
    RBP: ffff8800da9d7d78   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: ffff88041ed55968
    R13: 0000000000015900  R14: 0000000000000005  R15: ffff88041ed55900
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8800da9d7d80] pick_next_task_fair at ffffffff81084071
#10 [ffff8800da9d7df0] __schedule at ffffffff815bb507
#11 [ffff8800da9d7e40] schedule at ffffffff815bbcb7
#12 [ffff8800da9d7e60] do_nanosleep at ffffffff815be615
#13 [ffff8800da9d7ea0] hrtimer_nanosleep at ffffffff810acc96
#14 [ffff8800da9d7f20] sys_nanosleep at ffffffff810acdb6
#15 [ffff8800da9d7f50] system_call_fastpath at ffffffff815bf61b
    RIP: 00007fbde0eb3130  RSP: 00007ffcb4a0a7e8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffff8800da9d7f90  RCX: 00007fbde0eb3130
    RDX: 0000000055695905  RSI: 0000000000000000  RDI: 00007ffcb4a0a800
    RBP: 00007ffcb4a0a7e0   R8: 0000000000000000   R9: 00007ffcb4a0a740
    R10: 00007ffcb4a0a5b0  R11: 0000000000000246  R12: 0000000000000010
    R13: ffffffff815bf5f2  R14: ffffffffffffff10  R15: 00007ffcb4a0a800
    ORIG_RAX: 0000000000000023  CS: 0033  SS: 002b
crash> struct -x rq ffff88041ed55900
struct rq {
  lock = {
    raw_lock = {
      {
        head_tail = 0xb0ae, 
        tickets = {
          head = 0xae, 
          tail = 0xb0
        }
      }
    }
  }, 
  nr_running = 0xffffffff, 
  cpu_load = {0x3ff, 0x3ea, 0x35a, 0x29b, 0x225}, 
  last_load_update_tick = 0xffffa827, 
  nohz_stamp = 0x0, 
  nohz_flags = 0x0, 
  load = {
    weight = 0xfffffffffffffc00, 
    inv_weight = 0x0
  }, 
  nr_load_updates = 0xb644, 
  nr_switches = 0x5b5b0, 
  cfs = {
    load = {
      weight = 0xfffffffffffffc00, 
      inv_weight = 0x0
    }, 
    nr_running = 0xffffffff,
    h_nr_running = 0xffffffff, 
    exec_clock = 0xc9070a1c3, 
    min_vruntime = 0x70cee14b9, 
    tasks_timeline = {
      rb_node = 0x0
    }, 
    rb_leftmost = 0x0, 
    curr = 0x0, 
    next = 0x0, 
    last = 0x0, 
    skip = 0x0,



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-30 13:08   ` Mike Galbraith
@ 2015-05-31  6:39     ` Mike Galbraith
  2015-06-01  8:24       ` Peter Zijlstra
  2015-06-01  8:19     ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2015-05-31  6:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Sat, 2015-05-30 at 15:08 +0200, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 15:53 +0200, Peter Zijlstra wrote:
> > On Thu, May 28, 2015 at 09:43:52AM +0200, Mike Galbraith wrote:
> > > Hi Peter,
> > > 
> > > I'm not seeing what prevents pull_task() from yanking a task out from
> > > under __sched_setscheduler().  A box sprinkling smoldering 3.0 kernel
> > > wreckage all over my bugzilla mbox isn't seeing it either ;-)
> > 
> > Say, how easy can that thing be reproduced?
> > 
> > The below is compile tested only, but it might just work if I didn't
> > miss anything :-)
> 
> Seems trying to make the target invisible to balancing created a new
> race: dequeue target, do stuff that may drop rq->lock while it's
> dequeued, target sneaks into schedule(), dequeues itself (#2), boom.

Well, the below (ick) plugged it up, but...

I don't see why we can't just say no in can_migrate_task() if ->pi_lock
is held.  It plugged the original hole in a lot less lines.

Hohum, time to go pretend I have something better to do on a sunny
Sunday morning ;-)

  massive_intr_x-6213  [007] d...   170.579339: pull_rt_task: yup, pulled
  massive_intr_x-6213  [002] d...   170.580114: pull_rt_task: yup, pulled
  massive_intr_x-6213  [006] d...   170.586083: pull_rt_task: yup, pulled
           <...>-6237  [006] d...   170.593878: __schedule: saving the day

---
 kernel/sched/core.c     |   43 +++++++++++++++++++++++++++++++++++--------
 kernel/sched/deadline.c |    6 +++---
 kernel/sched/rt.c       |   11 +++++++++--
 kernel/sched/sched.h    |   10 +++++++++-
 4 files changed, 56 insertions(+), 14 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2745,9 +2745,18 @@ static void __sched __schedule(void)
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
 	 */
+dequeued:
 	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
+	if (unlikely(!task_on_rq_queued(prev))) {
+		trace_printk("saving the day\n");
+		tracing_off();
+		raw_spin_unlock_irq(&rq->lock);
+		cpu_relax();
+		goto dequeued;
+	}
+
 	rq->clock_skip_update <<= 1; /* promote REQ to ACT */
 
 	switch_count = &prev->nivcsw;
@@ -3013,8 +3022,10 @@ void rt_mutex_setprio(struct task_struct
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
 	running = task_current(rq, p);
-	if (queued)
+	if (queued) {
 		dequeue_task(rq, p, 0);
+		p->on_rq = TASK_ON_RQ_DEQUEUED;
+	}
 	if (running)
 		put_prev_task(rq, p);
 
@@ -3067,8 +3078,10 @@ void rt_mutex_setprio(struct task_struct
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
-	if (queued)
+	if (queued) {
 		enqueue_task(rq, p, enqueue_flag);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+	}
 
 	/*
 	 * Both switched_to() and prio_changed() are allowed to drop @rq->lock;
@@ -3114,8 +3127,10 @@ void set_user_nice(struct task_struct *p
 		goto out_unlock;
 	}
 	queued = task_on_rq_queued(p);
-	if (queued)
+	if (queued) {
 		dequeue_task(rq, p, 0);
+		p->on_rq = TASK_ON_RQ_DEQUEUED;
+	}
 
 	p->static_prio = NICE_TO_PRIO(nice);
 	set_load_weight(p);
@@ -3125,6 +3140,7 @@ void set_user_nice(struct task_struct *p
 
 	if (queued) {
 		enqueue_task(rq, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -3628,8 +3644,10 @@ static int __sched_setscheduler(struct t
 
 	queued = task_on_rq_queued(p);
 	running = task_current(rq, p);
-	if (queued)
+	if (queued) {
 		dequeue_task(rq, p, 0);
+		p->on_rq = TASK_ON_RQ_DEQUEUED;
+	}
 	if (running)
 		put_prev_task(rq, p);
 
@@ -3656,6 +3674,7 @@ static int __sched_setscheduler(struct t
 		 * increased (user space view).
 		 */
 		enqueue_task(rq, p, oldprio <= p->prio ? ENQUEUE_HEAD : 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
 	}
 
 	/*
@@ -4943,8 +4962,10 @@ void sched_setnuma(struct task_struct *p
 	queued = task_on_rq_queued(p);
 	running = task_current(rq, p);
 
-	if (queued)
+	if (queued) {
 		dequeue_task(rq, p, 0);
+		p->on_rq = TASK_ON_RQ_DEQUEUED;
+	}
 	if (running)
 		put_prev_task(rq, p);
 
@@ -4952,8 +4973,10 @@ void sched_setnuma(struct task_struct *p
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
-	if (queued)
+	if (queued) {
 		enqueue_task(rq, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+	}
 	task_rq_unlock(rq, p, &flags);
 }
 #endif
@@ -7587,8 +7610,10 @@ void sched_move_task(struct task_struct
 	running = task_current(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
-	if (queued)
+	if (queued) {
 		dequeue_task(rq, tsk, 0);
+		tsk->on_rq = TASK_ON_RQ_DEQUEUED;
+	}
 	if (unlikely(running))
 		put_prev_task(rq, tsk);
 
@@ -7611,8 +7636,10 @@ void sched_move_task(struct task_struct
 
 	if (unlikely(running))
 		tsk->sched_class->set_curr_task(rq);
-	if (queued)
+	if (queued) {
 		enqueue_task(rq, tsk, 0);
+		tsk->on_rq = TASK_ON_RQ_QUEUED;
+	}
 
 	task_rq_unlock(rq, tsk, &flags);
 }
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -607,7 +607,7 @@ static enum hrtimer_restart dl_task_time
 	 * We can be both throttled and !queued. Replenish the counter
 	 * but do not enqueue -- wait for our wakeup to do that.
 	 */
-	if (!task_on_rq_queued(p)) {
+	if (!task_on_rq_queued_or_dequeued(p)) {
 		replenish_dl_entity(dl_se, dl_se);
 		goto unlock;
 	}
@@ -1526,7 +1526,7 @@ static int pull_dl_task(struct rq *this_
 		     dl_time_before(p->dl.deadline,
 				    this_rq->dl.earliest_dl.curr))) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!task_on_rq_queued(p));
+			WARN_ON(!task_on_rq_queued_or_dequeued(p));
 
 			/*
 			 * Then we pull iff p has actually an earlier
@@ -1707,7 +1707,7 @@ static void switched_from_dl(struct rq *
 	 * this is the right place to try to pull some other one
 	 * from an overloaded cpu, if any.
 	 */
-	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+	if (!task_on_rq_queued_or_dequeued(p) || rq->dl.dl_nr_running)
 		return;
 
 	if (pull_dl_task(rq))
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1656,7 +1656,7 @@ static struct rq *find_lock_lowest_rq(st
 				     !cpumask_test_cpu(lowest_rq->cpu,
 						       tsk_cpus_allowed(task)) ||
 				     task_running(rq, task) ||
-				     !task_on_rq_queued(task))) {
+				     !task_on_rq_queued_or_dequeued(task))) {
 
 				double_unlock_balance(rq, lowest_rq);
 				lowest_rq = NULL;
@@ -1953,10 +1953,14 @@ static int pull_rt_task(struct rq *this_
 	int this_cpu = this_rq->cpu, ret = 0, cpu;
 	struct task_struct *p;
 	struct rq *src_rq;
+	int task_dequeued = 0;
 
 	if (likely(!rt_overloaded(this_rq)))
 		return 0;
 
+	if (this_rq->curr->on_rq == TASK_ON_RQ_DEQUEUED)
+		task_dequeued = 1;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
@@ -2035,6 +2039,9 @@ static int pull_rt_task(struct rq *this_
 		double_unlock_balance(this_rq, src_rq);
 	}
 
+	if (ret && task_dequeued)
+		trace_printk("yup, pulled\n");
+
 	return ret;
 }
 
@@ -2133,7 +2140,7 @@ static void switched_from_rt(struct rq *
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued_or_dequeued(p) || rq->rt.rt_nr_running)
 		return;
 
 	if (pull_rt_task(rq))
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -19,7 +19,8 @@ struct cpuidle_state;
 
 /* task_struct::on_rq states: */
 #define TASK_ON_RQ_QUEUED	1
-#define TASK_ON_RQ_MIGRATING	2
+#define TASK_ON_RQ_DEQUEUED	2
+#define TASK_ON_RQ_MIGRATING	3
 
 extern __read_mostly int scheduler_running;
 
@@ -1034,6 +1035,13 @@ static inline int task_on_rq_queued(stru
 	return p->on_rq == TASK_ON_RQ_QUEUED;
 }
 
+static inline int task_on_rq_queued_or_dequeued(struct task_struct *p)
+{
+	if (p->on_rq == TASK_ON_RQ_QUEUED)
+		return 1;
+	return p->on_rq == TASK_ON_RQ_DEQUEUED;
+}
+
 static inline int task_on_rq_migrating(struct task_struct *p)
 {
 	return p->on_rq == TASK_ON_RQ_MIGRATING;



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-29 18:48         ` Mike Galbraith
@ 2015-06-01  8:14           ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-06-01  8:14 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Fri, May 29, 2015 at 08:48:56PM +0200, Mike Galbraith wrote:
> 
> P.S. intel_idle is not all that wonderful on this box.
> 
> -   78.31%  [kernel]       [k] _raw_spin_lock                                                                                                                                                                                              ▒
>    - _raw_spin_lock                                                                                                                                                                                                                        ▒
>       - 94.91% tick_broadcast_oneshot_control                                                                                                                                                                                              ▒

Your DL980 G7 has E7-4800 parts in, right? Which if Wikipedia is
correct, resolves to a Nehalem-EX.

Now the NHM-EX has a fun 'feature' that for (some?) idle states the
local timer stops, so we have to revert back to a global broadcast
timer.

Now go count the number of cpus on your box and then imagine a global
spinlock, oh wait, you already found it ^ :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-29 18:30       ` Mike Galbraith
  2015-05-29 18:48         ` Mike Galbraith
@ 2015-06-01  8:16         ` Peter Zijlstra
  2015-06-01 10:00           ` Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-06-01  8:16 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Fri, May 29, 2015 at 08:30:27PM +0200, Mike Galbraith wrote:

> It took quite a bit longer than I thought it would, but I finally
> managed to cobble a standalone testcase together that brings nearly
> instant gratification on my 8 socket DL980.  Patched kernel explodes, so
> first cut ain't quite ready to ship ;-)

Could you perchance share this testcase?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-30 13:08   ` Mike Galbraith
  2015-05-31  6:39     ` Mike Galbraith
@ 2015-06-01  8:19     ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-06-01  8:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Sat, May 30, 2015 at 03:08:26PM +0200, Mike Galbraith wrote:

> Seems trying to make the target invisible to balancing created a new
> race: dequeue target, do stuff that may drop rq->lock while it's
> dequeued, target sneaks into schedule(), dequeues itself (#2), boom.

Aw god yes, duh.

Fun little puzzle this. Lemme go think a bit.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-05-31  6:39     ` Mike Galbraith
@ 2015-06-01  8:24       ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-06-01  8:24 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Ingo Molnar, ktkhai

On Sun, May 31, 2015 at 08:39:04AM +0200, Mike Galbraith wrote:
> I don't see why we can't just say no in can_migrate_task() if ->pi_lock
> is held.

I suppose we could do that; what I really want to avoid is also
requiring pi_lock for scheduling.

The down-side of looking at pi_lock for migration is that there is no
common point for migrating tasks, its all inside the classes, so we'd
get to sprinkle it all over the place.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched_setscheduler() vs idle_balance() race
  2015-06-01  8:16         ` Peter Zijlstra
@ 2015-06-01 10:00           ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2015-06-01 10:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, ktkhai

On Mon, 2015-06-01 at 10:16 +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2015 at 08:30:27PM +0200, Mike Galbraith wrote:
> 
> > It took quite a bit longer than I thought it would, but I finally
> > managed to cobble a standalone testcase together that brings nearly
> > instant gratification on my 8 socket DL980.  Patched kernel explodes, so
> > first cut ain't quite ready to ship ;-)
> 
> Could you perchance share this testcase?

Sure, I'll send it off list.  I'd just post it, but the duct tape might
fall off somewhere in transit, and land god knows where ;-)

	-Mike



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-06-01 10:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-28  7:43 sched_setscheduler() vs idle_balance() race Mike Galbraith
2015-05-28 11:51 ` Peter Zijlstra
2015-05-28 12:04   ` Mike Galbraith
2015-05-28 12:06     ` Peter Zijlstra
2015-05-28 12:32       ` Mike Galbraith
2015-05-28 13:53 ` Peter Zijlstra
2015-05-28 14:54   ` Mike Galbraith
2015-05-28 15:24     ` Peter Zijlstra
2015-05-29 18:30       ` Mike Galbraith
2015-05-29 18:48         ` Mike Galbraith
2015-06-01  8:14           ` Peter Zijlstra
2015-06-01  8:16         ` Peter Zijlstra
2015-06-01 10:00           ` Mike Galbraith
2015-05-28 16:59   ` Kirill Tkhai
2015-05-30 13:08   ` Mike Galbraith
2015-05-31  6:39     ` Mike Galbraith
2015-06-01  8:24       ` Peter Zijlstra
2015-06-01  8:19     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).