[PATCH] sched/fair: fix load_balance redo for null imbalance

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/fair: fix load_balance redo for null imbalance
@ 2018-09-07  7:51 Vincent Guittot
  2018-09-07 11:37 ` Peter Zijlstra
  2018-09-10 10:08 ` [tip:sched/core] sched/fair: Fix load_balance redo for !imbalance tip-bot for Vincent Guittot
  0 siblings, 2 replies; 8+ messages in thread
From: Vincent Guittot @ 2018-09-07  7:51 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: dietmar.eggemann, jhugo, Vincent Guittot

It can happen that load_balance finds a busiest group and then a busiest rq
but the calculated imbalance is in fact null.

In such situation, detach_tasks returns immediately and lets the flag
LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned
tasks and removed from the load balance mask. then, we redo a load balance
without the busiest CPU. This creates wrong load balance situation and
generates wrong task migration.

If the calculated imbalance is null, it's useless to try to find a busiest
rq as no task will be migrated and we can return immediately.

This situation can happen with heterogeneous system or smp system when RT
tasks are decreasing the capacity of some CPUs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..224bfae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8464,7 +8464,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	}

 	group = find_busiest_group(&env);
-	if (!group) {
+	if (!group || !env.imbalance) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
 		goto out_balanced;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
  2018-09-07  7:51 [PATCH] sched/fair: fix load_balance redo for null imbalance Vincent Guittot
@ 2018-09-07 11:37 ` Peter Zijlstra
  2018-09-07 12:35   ` Vincent Guittot
  2018-09-10 10:08 ` [tip:sched/core] sched/fair: Fix load_balance redo for !imbalance tip-bot for Vincent Guittot
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2018-09-07 11:37 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: mingo, linux-kernel, dietmar.eggemann, jhugo

On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> It can happen that load_balance finds a busiest group and then a busiest rq
> but the calculated imbalance is in fact null.

Cute. Does that happen often?

> If the calculated imbalance is null, it's useless to try to find a busiest
> rq as no task will be migrated and we can return immediately.
> 
> This situation can happen with heterogeneous system or smp system when RT
> tasks are decreasing the capacity of some CPUs.

Is it the result of one of those "force_balance" conditions in
find_busiest_group() ? Should we not fix that to then return NULL
instead?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
  2018-09-07 11:37 ` Peter Zijlstra
@ 2018-09-07 12:35   ` Vincent Guittot
  2018-09-07 12:55     ` Peter Zijlstra
  2018-09-07 12:55     ` Vincent Guittot
  0 siblings, 2 replies; 8+ messages in thread
From: Vincent Guittot @ 2018-09-07 12:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, dietmar.eggemann, jhugo

Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :
> On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> > It can happen that load_balance finds a busiest group and then a busiest rq
> > but the calculated imbalance is in fact null.
> 
> Cute. Does that happen often?

I have a use case with RT tasks that reproduces the problem regularly.
It happens at least when we have CPUs with different capacity either because
of heterogeous CPU or because of RT/DL reducing available capacity for cfs
I have put the call path that trigs the problem below and accroding to the
comment it seems that we can reach similar state when playing with priority.

> 
> > If the calculated imbalance is null, it's useless to try to find a busiest
> > rq as no task will be migrated and we can return immediately.
> > 
> > This situation can happen with heterogeneous system or smp system when RT
> > tasks are decreasing the capacity of some CPUs.
> 
> Is it the result of one of those "force_balance" conditions in
> find_busiest_group() ? Should we not fix that to then return NULL
> instead?

The UC is:
We have a newly_idle load balance that is triggered when RT task becomes idle
( but I think that I have seen that with idle load balance too)

we trigs:
	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
	    busiest->group_no_capacity)
		goto force_balance;

In calculate_imbalance we use the path
	/*
	 * Avg load of busiest sg can be less and avg load of local sg can
	 * be greater than avg load across all sgs of sd because avg load
	 * factors in sg capacity and sgs with smaller group_type are
	 * skipped when updating the busiest sg:
	 */
	if (busiest->avg_load <= sds->avg_load ||
	    local->avg_load >= sds->avg_load) {
		env->imbalance = 0;
		return fix_small_imbalance(env, sds);
	}

but fix_small_imbalance finally decides to return without modifying imbalance
like here
	if (busiest->avg_load + scaled_busy_load_per_task >=
	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
		env->imbalance = busiest->load_per_task;
		return;
	}

Beside this patch, I'm preparing another patch in fix small imbalance to
ensure 1 task per CPU in similar situation but according to the comment above,
we can reach this situation because of tasks priority


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
  2018-09-07 12:35   ` Vincent Guittot
@ 2018-09-07 12:55     ` Peter Zijlstra
  2018-09-07 12:57       ` Peter Zijlstra
  2018-09-07 13:08       ` Vincent Guittot
  2018-09-07 12:55     ` Vincent Guittot
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Zijlstra @ 2018-09-07 12:55 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: mingo, linux-kernel, dietmar.eggemann, jhugo

On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote:
> Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :
> > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> > > It can happen that load_balance finds a busiest group and then a busiest rq
> > > but the calculated imbalance is in fact null.
> > 
> > Cute. Does that happen often?
> 
> I have a use case with RT tasks that reproduces the problem regularly.
> It happens at least when we have CPUs with different capacity either because
> of heterogeous CPU or because of RT/DL reducing available capacity for cfs
> I have put the call path that trigs the problem below and accroding to the
> comment it seems that we can reach similar state when playing with priority.
> 
> > 
> > > If the calculated imbalance is null, it's useless to try to find a busiest
> > > rq as no task will be migrated and we can return immediately.
> > > 
> > > This situation can happen with heterogeneous system or smp system when RT
> > > tasks are decreasing the capacity of some CPUs.
> > 
> > Is it the result of one of those "force_balance" conditions in
> > find_busiest_group() ? Should we not fix that to then return NULL
> > instead?
> 
> The UC is:
> We have a newly_idle load balance that is triggered when RT task becomes idle
> ( but I think that I have seen that with idle load balance too)
> 
> we trigs:
> 	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> 	    busiest->group_no_capacity)
> 		goto force_balance;
> 
> In calculate_imbalance we use the path
> 	/*
> 	 * Avg load of busiest sg can be less and avg load of local sg can
> 	 * be greater than avg load across all sgs of sd because avg load
> 	 * factors in sg capacity and sgs with smaller group_type are
> 	 * skipped when updating the busiest sg:
> 	 */
> 	if (busiest->avg_load <= sds->avg_load ||
> 	    local->avg_load >= sds->avg_load) {
> 		env->imbalance = 0;
> 		return fix_small_imbalance(env, sds);
> 	}
> 
> but fix_small_imbalance finally decides to return without modifying imbalance
> like here
> 	if (busiest->avg_load + scaled_busy_load_per_task >=
> 	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
> 		env->imbalance = busiest->load_per_task;
> 		return;
> 	}

That one actually does modify imbalance :-) But I get your point.

> Beside this patch, I'm preparing another patch in fix small imbalance to
> ensure 1 task per CPU in similar situation but according to the comment above,
> we can reach this situation because of tasks priority

Didn't we all hate fix_small_imbalance() ?

Anyway, I think I'd prefer something like the below; although it might
be nicer to thread the return value through calculate_imbalance() and
fix_small_imbalance(), but looking at them that's not going to be
particularly nicer.

Do you agree with this?, If so, I'll stick your orignal Changelog on it
and pretend this is what you send me :-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..0596a29f3d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
 	calculate_imbalance(env, &sds);
-	return sds.busiest;
+	return env->imbalance ? sds.busiest : NULL;
 
 out_balanced:
 	env->imbalance = 0;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
  2018-09-07 12:35   ` Vincent Guittot
  2018-09-07 12:55     ` Peter Zijlstra
@ 2018-09-07 12:55     ` Vincent Guittot
  1 sibling, 0 replies; 8+ messages in thread
From: Vincent Guittot @ 2018-09-07 12:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, jhugo

On Fri, 7 Sep 2018 at 14:35, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :
> > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> > > It can happen that load_balance finds a busiest group and then a busiest rq
> > > but the calculated imbalance is in fact null.
> >
> > Cute. Does that happen often?
>
> I have a use case with RT tasks that reproduces the problem regularly.
> It happens at least when we have CPUs with different capacity either because
> of heterogeous CPU or because of RT/DL reducing available capacity for cfs
> I have put the call path that trigs the problem below and accroding to the
> comment it seems that we can reach similar state when playing with priority.
>
> >
> > > If the calculated imbalance is null, it's useless to try to find a busiest
> > > rq as no task will be migrated and we can return immediately.
> > >
> > > This situation can happen with heterogeneous system or smp system when RT
> > > tasks are decreasing the capacity of some CPUs.
> >
> > Is it the result of one of those "force_balance" conditions in
> > find_busiest_group() ? Should we not fix that to then return NULL
> > instead?
>
> The UC is:
> We have a newly_idle load balance that is triggered when RT task becomes idle
> ( but I think that I have seen that with idle load balance too)
>
> we trigs:
>         if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
>             busiest->group_no_capacity)
>                 goto force_balance;
>
> In calculate_imbalance we use the path
>         /*
>          * Avg load of busiest sg can be less and avg load of local sg can
>          * be greater than avg load across all sgs of sd because avg load
>          * factors in sg capacity and sgs with smaller group_type are
>          * skipped when updating the busiest sg:
>          */
>         if (busiest->avg_load <= sds->avg_load ||
>             local->avg_load >= sds->avg_load) {
>                 env->imbalance = 0;
>                 return fix_small_imbalance(env, sds);
>         }
>
> but fix_small_imbalance finally decides to return without modifying imbalance
> like here
>         if (busiest->avg_load + scaled_busy_load_per_task >=
>             local->avg_load + (scaled_busy_load_per_task * imbn)) {
>                 env->imbalance = busiest->load_per_task;
>                 return;
>         }
>
> Beside this patch, I'm preparing another patch in fix small imbalance to
> ensure 1 task per CPU in similar situation but according to the comment above,
> we can reach this situation because of tasks priority

I have just done a quick test on my smp hikey board (dual quad core
arm64) by adding a log in dmesg each time we have the condition
busiest != null and imbalance == 0. The log happens from time to time
when I generate some activity on the baord like syncing the filesystem
before running a test. But I don't have the details. The logs happen
with and without the next patch that I mentioned above. So it probably
means that we can trig this situation with other UCs

>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
  2018-09-07 12:55     ` Peter Zijlstra
@ 2018-09-07 12:57       ` Peter Zijlstra
  2018-09-07 13:08       ` Vincent Guittot
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2018-09-07 12:57 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: mingo, linux-kernel, dietmar.eggemann, jhugo

On Fri, Sep 07, 2018 at 02:55:54PM +0200, Peter Zijlstra wrote:
> > Beside this patch, I'm preparing another patch in fix small imbalance to
> > ensure 1 task per CPU in similar situation but according to the comment above,
> > we can reach this situation because of tasks priority
> 
> Didn't we all hate fix_small_imbalance() ?

That is, all that load_per_task business is complete rubbish in this
cgroup world.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
  2018-09-07 12:55     ` Peter Zijlstra
  2018-09-07 12:57       ` Peter Zijlstra
@ 2018-09-07 13:08       ` Vincent Guittot
  1 sibling, 0 replies; 8+ messages in thread
From: Vincent Guittot @ 2018-09-07 13:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, jhugo

On Fri, 7 Sep 2018 at 14:56, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote:
> > Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :
> > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> > > > It can happen that load_balance finds a busiest group and then a busiest rq
> > > > but the calculated imbalance is in fact null.
> > >
> > > Cute. Does that happen often?
> >
> > I have a use case with RT tasks that reproduces the problem regularly.
> > It happens at least when we have CPUs with different capacity either because
> > of heterogeous CPU or because of RT/DL reducing available capacity for cfs
> > I have put the call path that trigs the problem below and accroding to the
> > comment it seems that we can reach similar state when playing with priority.
> >
> > >
> > > > If the calculated imbalance is null, it's useless to try to find a busiest
> > > > rq as no task will be migrated and we can return immediately.
> > > >
> > > > This situation can happen with heterogeneous system or smp system when RT
> > > > tasks are decreasing the capacity of some CPUs.
> > >
> > > Is it the result of one of those "force_balance" conditions in
> > > find_busiest_group() ? Should we not fix that to then return NULL
> > > instead?
> >
> > The UC is:
> > We have a newly_idle load balance that is triggered when RT task becomes idle
> > ( but I think that I have seen that with idle load balance too)
> >
> > we trigs:
> >       if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> >           busiest->group_no_capacity)
> >               goto force_balance;
> >
> > In calculate_imbalance we use the path
> >       /*
> >        * Avg load of busiest sg can be less and avg load of local sg can
> >        * be greater than avg load across all sgs of sd because avg load
> >        * factors in sg capacity and sgs with smaller group_type are
> >        * skipped when updating the busiest sg:
> >        */
> >       if (busiest->avg_load <= sds->avg_load ||
> >           local->avg_load >= sds->avg_load) {
> >               env->imbalance = 0;
> >               return fix_small_imbalance(env, sds);
> >       }
> >
> > but fix_small_imbalance finally decides to return without modifying imbalance
> > like here
> >       if (busiest->avg_load + scaled_busy_load_per_task >=
> >           local->avg_load + (scaled_busy_load_per_task * imbn)) {
> >               env->imbalance = busiest->load_per_task;
> >               return;
> >       }
>
> That one actually does modify imbalance :-) But I get your point.
>
> > Beside this patch, I'm preparing another patch in fix small imbalance to
> > ensure 1 task per CPU in similar situation but according to the comment above,
> > we can reach this situation because of tasks priority
>
> Didn't we all hate fix_small_imbalance() ?

yes.  the rational of some decisions in the function are more and more
opaque to me. For now I'm trying to fix all UCs that I can find to
then try to get a generic rule
As an example, I have some situations (other than those discussed
here) in fix_small_imbalance where the task migration decision mainly
depends of a variation of +/-5 in the load of a task. This finally
generates good fairness between tasks but the root cause of this
fairness is a bit random.

>
> Anyway, I think I'd prefer something like the below; although it might
> be nicer to thread the return value through calculate_imbalance() and
> fix_small_imbalance(), but looking at them that's not going to be
> particularly nicer.
>
> Do you agree with this?, If so, I'll stick your orignal Changelog on it
> and pretend this is what you send me :-)

That's fine to me :-)

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb596f6c1..0596a29f3d2a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  force_balance:
>         /* Looks like there is an imbalance. Compute it */
>         calculate_imbalance(env, &sds);
> -       return sds.busiest;
> +       return env->imbalance ? sds.busiest : NULL;
>
>  out_balanced:
>         env->imbalance = 0;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:sched/core] sched/fair: Fix load_balance redo for !imbalance
  2018-09-07  7:51 [PATCH] sched/fair: fix load_balance redo for null imbalance Vincent Guittot
  2018-09-07 11:37 ` Peter Zijlstra
@ 2018-09-10 10:08 ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 8+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-09-10 10:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, vincent.guittot, tglx, peterz, torvalds, linux-kernel, mingo

Commit-ID:  bb3485c8ace6475c269b1aa2da674490f455f412
Gitweb:     https://git.kernel.org/tip/bb3485c8ace6475c269b1aa2da674490f455f412
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 7 Sep 2018 09:51:04 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 10 Sep 2018 10:13:49 +0200

sched/fair: Fix load_balance redo for !imbalance

It can happen that load_balance() finds a busiest group and then a
busiest rq but the calculated imbalance is in fact 0.

In such situation, detach_tasks() returns immediately and lets the
flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
have pinned tasks and removed from the load balance mask. then, we
redo a load balance without the busiest CPU. This creates wrong load
balance situation and generates wrong task migration.

If the calculated imbalance is 0, it's useless to try to find a
busiest rq as no task will be migrated and we can return immediately.

This situation can happen with heterogeneous system or smp system when
RT tasks are decreasing the capacity of some CPUs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: jhugo@codeaurora.org
Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f12d004be6a1..fc9a484ef82b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8275,7 +8275,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
 	calculate_imbalance(env, &sds);
-	return sds.busiest;
+	return env->imbalance ? sds.busiest : NULL;
 
 out_balanced:
 	env->imbalance = 0;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-09-10 10:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07  7:51 [PATCH] sched/fair: fix load_balance redo for null imbalance Vincent Guittot
2018-09-07 11:37 ` Peter Zijlstra
2018-09-07 12:35   ` Vincent Guittot
2018-09-07 12:55     ` Peter Zijlstra
2018-09-07 12:57       ` Peter Zijlstra
2018-09-07 13:08       ` Vincent Guittot
2018-09-07 12:55     ` Vincent Guittot
2018-09-10 10:08 ` [tip:sched/core] sched/fair: Fix load_balance redo for !imbalance tip-bot for Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).