[PATCH V6] sched/fair: Remove group imbalance from calculate

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
@ 2017-07-13 19:55 Jeffrey Hugo
  2017-07-18 19:48 ` Dietmar Eggemann
  0 siblings, 1 reply; 8+ messages in thread
From: Jeffrey Hugo @ 2017-07-13 19:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi, Jeffrey Hugo

The group_imbalance path in calculate_imbalance() made sense when it was
added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
balance across sched domain") because busiest->load_per_task factored into
the amount of imbalance that was calculated. Beginning with commit
dd5feea14a7d ("sched: Fix SCHED_MC regression caused by change in sched
cpu_power"), busiest->load_per_task is not a factor in the imbalance
calculation, thus the group_imbalance path no longer makes sense.

The group_imbalance path can only affect the outcome of
calculate_imbalance() when the average load of the domain is less than the
original busiest->load_per_task. In this case, busiest->load_per_task is
overwritten with the scheduling domain load average. Thus
busiest->load_per_task no longer represents actual load that can be moved.

At the final comparison between env->imbalance and busiest->load_per_task,
imbalance may be larger than the new busiest->load_per_task causing the
check to fail under the assumption that there is a task that could be
migrated to satisfy the imbalance. However env->imbalance may still be
smaller than the original busiest->load_per_task, thus it is unlikely that
there is a task that can be migrated to satisfy the imbalance.
Calculate_imbalance() would not choose to run fix_small_imbalance() when we
expect it should. In the worst case, this can result in idle cpus.

Since the group imbalance path in calculate_imbalance() is at best a NOP
but otherwise harmful, remove it.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---

[v6]
-Added additional history clarification to commit text

 kernel/sched/fair.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84255ab..3600713 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;

-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure cpu-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-13 19:55 [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
@ 2017-07-18 19:48 ` Dietmar Eggemann
  2017-07-26 14:54   ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Dietmar Eggemann @ 2017-07-18 19:48 UTC (permalink / raw)
  To: Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Austin Christ, Tyler Baicar, Timur Tabi

Hi Jeffrey,

On 13/07/17 20:55, Jeffrey Hugo wrote:
> The group_imbalance path in calculate_imbalance() made sense when it was
> added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
> balance across sched domain") because busiest->load_per_task factored into
> the amount of imbalance that was calculated. Beginning with commit
> dd5feea14a7d ("sched: Fix SCHED_MC regression caused by change in sched
> cpu_power"), busiest->load_per_task is not a factor in the imbalance
> calculation, thus the group_imbalance path no longer makes sense.

You're referring here to the use of 'sds->max_load -
sds->busiest_load_per_task' in the calculation of max_pull which got
replaced by load_above_capacity with dd5feea14a7d?

I still wonder if the original code (908a7c1b9b80)

if (group_imb)
    busiest_load_per_task = min(busiest_load_per_task, avg_load);

had something to do with the following:

if (max_load <= busiest_load_per_task)
    goto out_balanced;

> The group_imbalance path can only affect the outcome of
> calculate_imbalance() when the average load of the domain is less than the
> original busiest->load_per_task. In this case, busiest->load_per_task is
> overwritten with the scheduling domain load average. Thus
> busiest->load_per_task no longer represents actual load that can be moved.
> 
> At the final comparison between env->imbalance and busiest->load_per_task,
> imbalance may be larger than the new busiest->load_per_task causing the
> check to fail under the assumption that there is a task that could be
> migrated to satisfy the imbalance. However env->imbalance may still be
> smaller than the original busiest->load_per_task, thus it is unlikely that
> there is a task that can be migrated to satisfy the imbalance.
> Calculate_imbalance() would not choose to run fix_small_imbalance() when we
> expect it should. In the worst case, this can result in idle cpus.
> 
> Since the group imbalance path in calculate_imbalance() is at best a NOP
> but otherwise harmful, remove it.
>

IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
level sched groups.

Running the testcase 'taskset 0x05 '2 always running task'' (both tasks
starting on cpu0) on your machine shows the issue since with your
previous patch [1] "sched/fair: Fix load_balance() affinity redo path"
we now propagate 'group imbalance' from MC level to DIE level and since
you have n > 2 you lower busiest->load_per_task in this group_imbalanced
related if condition all the time and env->imbalance stays too small to
let one of these tasks migrate to cpu2.

Tried to test it on an Intel i5-3320M (2 cores x 2 HT) with rt-app (2
always running cfs task with affinity 0x05 for 2*x ms and one rt task
affine to 0x04 for x ms):

# cat /proc/schedstat | grep ^domain | awk '{ print $1" "$2}'
domain0 03
domain1 0f
domain0 03
domain1 0f
domain0 0c
domain1 0f
domain0 0c
domain1 0f

but here the prefer_sibling handling (group overloaded) eclipses 'group
imbalance' the moment one of the cfs tasks can go to cpu2 so the if
condition you got rid of is a nop.

I wonder if it is fair to say that your fix helps multi-cluster
(especially with n > 2) systems without SMT and with your first patch
[1] for this specific, cpu affinity restricted test cases.

> Co-authored-by: Austin Christ <austinwc@codeaurora.org>
> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
> ---

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> 
> [v6]
> -Added additional history clarification to commit text
> 
>  kernel/sched/fair.c | 9 ---------
>  1 file changed, 9 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84255ab..3600713 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	local = &sds->local_stat;
>  	busiest = &sds->busiest_stat;
>  
> -	if (busiest->group_type == group_imbalanced) {
> -		/*
> -		 * In the group_imb case we cannot rely on group-wide averages
> -		 * to ensure cpu-load equilibrium, look at wider averages. XXX
> -		 */
> -		busiest->load_per_task =
> -			min(busiest->load_per_task, sds->avg_load);
> -	}
> -
>  	/*
>  	 * Avg load of busiest sg can be less and avg load of local sg can
>  	 * be greater than avg load across all sgs of sd because avg load
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-18 19:48 ` Dietmar Eggemann
@ 2017-07-26 14:54   ` Peter Zijlstra
  2017-07-28 12:16     ` Dietmar Eggemann
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-07-26 14:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Jeffrey Hugo, Ingo Molnar, linux-kernel, Austin Christ,
	Tyler Baicar, Timur Tabi

On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote:
> Hi Jeffrey,
> 
> On 13/07/17 20:55, Jeffrey Hugo wrote:
> > The group_imbalance path in calculate_imbalance() made sense when it was
> > added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
> > balance across sched domain") because busiest->load_per_task factored into
> > the amount of imbalance that was calculated. Beginning with commit
> > dd5feea14a7d ("sched: Fix SCHED_MC regression caused by change in sched
> > cpu_power"), busiest->load_per_task is not a factor in the imbalance
> > calculation, thus the group_imbalance path no longer makes sense.

Bit quick that. If its no longer used, then who cares what value it
is... /me reads on.

> You're referring here to the use of 'sds->max_load -
> sds->busiest_load_per_task' in the calculation of max_pull which got
> replaced by load_above_capacity with dd5feea14a7d?
> 
> I still wonder if the original code (908a7c1b9b80)
> 
> if (group_imb)
>     busiest_load_per_task = min(busiest_load_per_task, avg_load);
> 
> had something to do with the following:
> 
> if (max_load <= busiest_load_per_task)
>     goto out_balanced;

Quite possibly, yes. By lowering busiest_load_per_task it skips that
test. But also, as noted, the lower busiest_load_per_task is then used
in the imbalance calculation to allow moving more load around, so its
not only that.

> > The group_imbalance path can only affect the outcome of
> > calculate_imbalance() when the average load of the domain is less than the
> > original busiest->load_per_task. In this case, busiest->load_per_task is
> > overwritten with the scheduling domain load average. Thus
> > busiest->load_per_task no longer represents actual load that can be moved.
> > 
> > At the final comparison between env->imbalance and busiest->load_per_task,
> > imbalance may be larger than the new busiest->load_per_task causing the
> > check to fail under the assumption that there is a task that could be
> > migrated to satisfy the imbalance. However env->imbalance may still be
> > smaller than the original busiest->load_per_task, thus it is unlikely that
> > there is a task that can be migrated to satisfy the imbalance.
> > Calculate_imbalance() would not choose to run fix_small_imbalance() when we
> > expect it should. In the worst case, this can result in idle cpus.
> > 
> > Since the group imbalance path in calculate_imbalance() is at best a NOP
> > but otherwise harmful, remove it.

Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used
to make sense a long time ago, but smp-nice and then cgroups made a
complete joke of things.

> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
> level sched groups.

That'd be a NUMA box?

> Running the testcase 'taskset 0x05 '2 always running task'' (both tasks
> starting on cpu0) on your machine shows the issue since with your
> previous patch [1] "sched/fair: Fix load_balance() affinity redo path"
> we now propagate 'group imbalance' from MC level to DIE level and since
> you have n > 2 you lower busiest->load_per_task in this group_imbalanced
> related if condition all the time and env->imbalance stays too small to
> let one of these tasks migrate to cpu2.
> 
> Tried to test it on an Intel i5-3320M (2 cores x 2 HT) with rt-app (2
> always running cfs task with affinity 0x05 for 2*x ms and one rt task
> affine to 0x04 for x ms):
> 
> # cat /proc/schedstat | grep ^domain | awk '{ print $1" "$2}'
> domain0 03
> domain1 0f
> domain0 03
> domain1 0f
> domain0 0c
> domain1 0f
> domain0 0c
> domain1 0f
> 
> but here the prefer_sibling handling (group overloaded) eclipses 'group
> imbalance' the moment one of the cfs tasks can go to cpu2 so the if
> condition you got rid of is a nop.
> 
> I wonder if it is fair to say that your fix helps multi-cluster
> (especially with n > 2) systems without SMT and with your first patch
> [1] for this specific, cpu affinity restricted test cases.

I tried on an IVB-EP with all the HT siblings unplugged, could not
reproduce either. Still at n=2 though. Let me fire up an EX, that'll get
me n=4.

So this is 4 * 18 * 2 = 144 cpus:

# for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
# taskset -pc 0,18 $$
# while :; do :; done & while :; do :; done &

So I'm taking SMT out, affine to first and second MC group, start 2
loops.

Using another console I see them both using 100%.

If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
Then instantly they balance and I get 2x100% back.

Anything else I need to reproduce? (other than maybe a slightly less
insane machine :-)


Because I have the feeling that while this patch cures things for you,
you're fighting symptoms.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-26 14:54   ` Peter Zijlstra
@ 2017-07-28 12:16     ` Dietmar Eggemann
  2017-07-28 12:59       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Dietmar Eggemann @ 2017-07-28 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeffrey Hugo, Ingo Molnar, linux-kernel, Austin Christ,
	Tyler Baicar, Timur Tabi

On 26/07/17 15:54, Peter Zijlstra wrote:
> On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote:
>> Hi Jeffrey,
>>
>> On 13/07/17 20:55, Jeffrey Hugo wrote:

[...]

>>> Since the group imbalance path in calculate_imbalance() is at best a NOP
>>> but otherwise harmful, remove it.
> 
> Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used
> to make sense a long time ago, but smp-nice and then cgroups made a
> complete joke of things.
> 
>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>> level sched groups.
> 
> That'd be a NUMA box?

I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

[...]

>> but here the prefer_sibling handling (group overloaded) eclipses 'group
>> imbalance' the moment one of the cfs tasks can go to cpu2 so the if
>> condition you got rid of is a nop.
>>
>> I wonder if it is fair to say that your fix helps multi-cluster
>> (especially with n > 2) systems without SMT and with your first patch
>> [1] for this specific, cpu affinity restricted test cases.
> 
> I tried on an IVB-EP with all the HT siblings unplugged, could not
> reproduce either. Still at n=2 though. Let me fire up an EX, that'll get
> me n=4.
> 
> So this is 4 * 18 * 2 = 144 cpus:

Impressive ;-)

> 
> # for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
> # taskset -pc 0,18 $$
> # while :; do :; done & while :; do :; done &
> 
> So I'm taking SMT out, affine to first and second MC group, start 2
> loops.
> 
> Using another console I see them both using 100%.
> 
> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> Then instantly they balance and I get 2x100% back.

Yeah, could reproduce on IVB-EP (2x10x2).

> Anything else I need to reproduce? (other than maybe a slightly less
> insane machine :-)

I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
lowered to 'sds->avg_load' in case of an imbalanced busiest sg:

  if (busiest->group_type == group_imbalanced)
    busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);

is so low that later fix_small_imbalance() won't be called and
'env->imbalance' stays so low that load-balance of on 50% task to the
now idle cpu won't happen.

  if (env->imbalance < busiest->load_per_task)
    fix_small_imbalance(env, sds);

Having really a lot of otherwise idle DIE sg's helps to keep
'sds->avg_load' low in comparison to 'busiest->load_per_task'.

> Because I have the feeling that while this patch cures things for you,
> you're fighting symptoms.

Unfortunately, don't have a machine available with n >> 2 (on DIE or
NUMA) ...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-28 12:16     ` Dietmar Eggemann
@ 2017-07-28 12:59       ` Peter Zijlstra
  2017-07-28 13:34         ` Dietmar Eggemann
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-07-28 12:59 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Jeffrey Hugo, Ingo Molnar, linux-kernel, Austin Christ,
	Tyler Baicar, Timur Tabi

On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
> >> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
> >> level sched groups.
> > 
> > That'd be a NUMA box?
> 
> I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

Ah, I can't read. I thought >2 DIEs.

> > So this is 4 * 18 * 2 = 144 cpus:
> 
> Impressive ;-)

Takes forever to boot though :/

> > If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> > Then instantly they balance and I get 2x100% back.
> 
> Yeah, could reproduce on IVB-EP (2x10x2).

OK, I have one of those. What should I do, because I didn't actually see
anything odd.

> > Anything else I need to reproduce? (other than maybe a slightly less
> > insane machine :-)
> 
> I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
> lowered to 'sds->avg_load' in case of an imbalanced busiest sg:
> 
>   if (busiest->group_type == group_imbalanced)
>     busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);
> 
> is so low that later fix_small_imbalance() won't be called and
> 'env->imbalance' stays so low that load-balance of on 50% task to the
> now idle cpu won't happen.
> 
>   if (env->imbalance < busiest->load_per_task)
>     fix_small_imbalance(env, sds);
> 
> Having really a lot of otherwise idle DIE sg's helps to keep
> 'sds->avg_load' low in comparison to 'busiest->load_per_task'.

Right, but the whole load_per_task thing is a bit wonky, and since
that's the basis of fix_small_imbalance() I'm very suspect.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-28 12:59       ` Peter Zijlstra
@ 2017-07-28 13:34         ` Dietmar Eggemann
  2017-07-28 18:09           ` Jeffrey Hugo
  0 siblings, 1 reply; 8+ messages in thread
From: Dietmar Eggemann @ 2017-07-28 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeffrey Hugo, Ingo Molnar, linux-kernel, Austin Christ,
	Tyler Baicar, Timur Tabi

On 28/07/17 13:59, Peter Zijlstra wrote:
> On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
>>>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>>>> level sched groups.

[...]

>>> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
>>> Then instantly they balance and I get 2x100% back.
>>
>> Yeah, could reproduce on IVB-EP (2x10x2).
> 
> OK, I have one of those. What should I do, because I didn't actually see
> anything odd.

Me neither. Sorry, I was unclear. I meant I could reproduce your
example, that one of the 50% task moves to the idle cpu on this machine.

[...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-28 13:34         ` Dietmar Eggemann
@ 2017-07-28 18:09           ` Jeffrey Hugo
  2017-08-16 16:41             ` Jeffrey Hugo
  0 siblings, 1 reply; 8+ messages in thread
From: Jeffrey Hugo @ 2017-07-28 18:09 UTC (permalink / raw)
  To: Dietmar Eggemann, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Austin Christ, Tyler Baicar, Timur Tabi

On 7/28/2017 7:34 AM, Dietmar Eggemann wrote:
> On 28/07/17 13:59, Peter Zijlstra wrote:
>> On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
>>>>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>>>>> level sched groups.
> 
> [...]
> 
>>>> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
>>>> Then instantly they balance and I get 2x100% back.
>>>
>>> Yeah, could reproduce on IVB-EP (2x10x2).
>>
>> OK, I have one of those. What should I do, because I didn't actually see
>> anything odd.
> 
> Me neither. Sorry, I was unclear. I meant I could reproduce your
> example, that one of the 50% task moves to the idle cpu on this machine.
> 
> [...]
> 

Dietmar's assessment is essentially correct. We are seeing a 
circumstance where fix_small_imbalance() could fix a situation with an 
idle core and an over-worked core, but it is being skipped. Here is a 
detailed step through of the specific case we're seeing.

Consider the following system.

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]

Not NUMA, not SMT.

Imagine we have a workload which spawns 5 CPU intensive workers, which 
we want taskset to 5 cores such that each worker has its own core.  We 
want to see:

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T T T   T

Where * indicates the taskset, and T indicates each thread/process.

In many cases, workloads are not ideally assigned at fork(), because 
load statistics haven't had a chance to be generated by the new work. 
This results in the following:

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T   T   T
                T

Core 4 is over-assigned (more workers than it can efficiently handle) 
and core 2 is under-assigned (no work to do).  Not great, but we expect 
the system to load_balance() quickly, and move one T from core 4 to core 2.

We do not see this occur, which means the scheduler is not doing its job 
by leaving a core idle when there's work to do.

What happens -

Core 5 is idle, so it triggers load_balance().  At the MC level, it 
identifies core 4 as busiest, and attempts to pull load.  Core 5 fails 
because it is not part of the taskset, and load cannot be moved to 
another available core (6 or 7), so the group_imbalance flag is set.

Core 2 is idle, so it triggers load_balance().  It won't pull any load 
from its siblings in MC (0, 1, or 3) because doing so does not solve an 
imbalance, it just moves the imbalance around.  load_balance() at MC 
fails, so it moves up to DIE.  Since the group containing core 4 is 
group_imbalance, it is selected as busiest.

We go to calculate_imbalance().  Since each T is CPU intensive, their 
load is high, lets say 200, thus busiest->load_per_task is 200. 
However, we have a lot of idle or lightly loaded cores at the DIE level, 
so the sds average (sds->avg_load) gets pushed down significantly; lets 
say it is 20.  Since busiest->group_imbalance is true, we take the min 
of busiest->load_per_task and sds average (200 and 20), and make that 
the new busiest->load_per_task.

We calculate env->imbalance, which we'll say is 48.

Now we hit the end check -
	/*
	 * if *imbalance is less than the average load per runnable task
	 * there is no guarantee that any tasks will be moved so we'll have
	 * a think about bumping its value to force at least one task to be
	 * moved
	 */
	if (env->imbalance < busiest->load_per_task)
		return fix_small_imbalance(env, sds);

We try to see if the calculated imbalance can be solved by moving tasks 
around by comparing the calculated imbalance (env->imbalance) to the 
calculated average load per task in the busiest group 
(busiest->load_per_task).  Since busiest->load_per_task is an average, 
either there are tasks with load less than that, or all tasks are at 
that value, thus if env->imbalance is more than busiest->load_per_task, 
we have candidate tasks we can evaluate to address the imbalance.

env->imbalance is 48, and the modified busiest->load_per_task is 20, so 
the check fails, and we exit out of calculate_imbalance() with an 
imbalance to solve of 48.  We jump all the way up to load_balance(), and 
then into detach_tasks().  We find two tasks that are migration 
candidates because core 2 is in the same taskset as core 4, so we 
evaluate if they can help solve the imbalance, or would just move the 
imbalance around -

		if ((load / 2) > env->imbalance)
			goto next;

Load of each task is 200, so 200 / 2 = 100, 100 > 48.  We can't move 
anything as doing so would not help solve the imbalance, just move it 
around, and thus why take the hit of actually migrating the process.

We might not resolve the imbalance, but we get better CPU usage if we 
move a task from core 4 to core 2, which is what we expect. Based on a 
comment within the function fix_small_imbalance(), this appears to be 
the correct path for this circumstance.

In fix_small_imbalance() -

	/*
	 * OK, we don't have enough imbalance to justify moving tasks,
	 * however we may be able to increase total CPU capacity used by
	 * moving them.
	 */

As mentioned above, we aren't hitting this function path for the 
following reason:
	if (env->imbalance < busiest->load_per_task)
		return fix_small_imbalance(env, sds);

env->imbalance is 48, and busiest->load_per_task is 20, but 
busiest->load_per_task was 200.  48 < 200 is true, so we would hit 
fix_small_imbalance. The group_imbalance path modifies 
busiest->load_per_task, so what happens when we remove that path (i.e. 
this change)?  We see the behavior we expect.

This -

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T   T   T
                T

Goes to this -

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T T T   T

So, we've identified an issue, and root caused it to a small code 
snippet.  Great for us, but the scheduler is generic code, it needs to 
work for all.  Can this code really be removed?

Here is our current thinking on all uses of this code path. Let us know 
if we missed anything.

SMT avoids this issue because of SD_PREFER_SIBLING - group overloaded is 
like group imbalanced, but does not set busiest->group_imbalance so 
busiest->load_per_task never gets modified.

In cases where the path is actually used:

1- busiest->group_imbalance is not set.  The code path is equal to a 
NOP, therefore it has no positive or negative effect.

2- busiest->group_imbalance is set, but sds->avg_load is greater than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of busiest->load_per_task, so nothing changes.  The code path is equal 
to a NOP, therefore it has no positive or negative effect.

3- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is more 
than the original busiest->load_per_task, it will be more than 
sds->avg_load, so we do not hit fix_small_imbalance() as expected.  The 
code path is equal to a NOP, therefore it has no positive or negative 
effect.

4- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is less 
than the original busiest->load_per_task, but also less than 
sds->avg_load, we hit fix_small_imbalance(), but would have hit that 
anyways.  The code path is equal to a NOP, therefore it has no positive 
or negative effect.

5- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is less 
than the original busiest->load_per_task, but more than sds->avg_load, 
we do not hit fix_small_imbalance().  If we happen to have tasks we can 
move around based on their load values compared to the current 
env->imbalance, we can make migrations.  The code path is equal to a 
NOP, therefore it has no positive or negative effect.

6- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is less 
than the original busiest->load_per_task, but more than sds->avg_load, 
we do not hit fix_small_imbalance().  If we do not have tasks which can 
migrate based on their load value compared to the current 
env->imbalance, we hit the issue we've described above.  The code path 
has affected the behavior of the scheduler in a way which is not 
expected, therefore it has a negative effect.

So, in all of these scenarios but one, the group_imbalance code path of 
calculate_imbalance() does nothing.  In the remaining one scenario, the 
code path prevents the scheduler from doing its job (assigning work to 
idle cores), so it harms the system.

Thus we conclude that the code has no benefit, and should be removed.

Considered from a different perspective, busiest->load_per_task, in the 
current algorithm, represents actual load that can be moved around. 
Setting it to a global average breaks this invariant and causes a faulty 
decision at the fix_small_imbalance() path.

In summary, our system is harmed by the code branch, and our analysis 
indicates the branch has no benefit to the scheduler as it sits today, 
we feel justified in removing it.

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-07-28 18:09           ` Jeffrey Hugo
@ 2017-08-16 16:41             ` Jeffrey Hugo
  0 siblings, 0 replies; 8+ messages in thread
From: Jeffrey Hugo @ 2017-08-16 16:41 UTC (permalink / raw)
  To: Dietmar Eggemann, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Austin Christ, Tyler Baicar, Timur Tabi

On 7/28/2017 12:09 PM, Jeffrey Hugo wrote:
> In summary, our system is harmed by the code branch, and our analysis 
> indicates the branch has no benefit to the scheduler as it sits today, 
> we feel justified in removing it.
> 

Ping?

Peter, did you find the additional information we provided useful?  Are 
you still attempting to replicate the issue?  Is there anything you need 
from us?

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-08-16 16:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-13 19:55 [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
2017-07-18 19:48 ` Dietmar Eggemann
2017-07-26 14:54   ` Peter Zijlstra
2017-07-28 12:16     ` Dietmar Eggemann
2017-07-28 12:59       ` Peter Zijlstra
2017-07-28 13:34         ` Dietmar Eggemann
2017-07-28 18:09           ` Jeffrey Hugo
2017-08-16 16:41             ` Jeffrey Hugo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.