[PATCH V2] sched: Improve load balancing in the presence of idle CPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
@ 2015-03-26 13:02 Preeti U Murthy
  2015-03-26 17:03 ` Jason Low
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-26 13:02 UTC (permalink / raw)
  To: peterz, mingo
  Cc: riel, daniel.lezcano, vincent.guittot, srikar, pjt, benh, efault,
	linux-kernel, iamjoonsoo.kim, svaidy, tim.c.chen,
	morten.rasmussen, jason.low2

When a CPU is kicked to do nohz idle balancing, it wakes up to do load
balancing on itself, followed by load balancing on behalf of idle CPUs.
But it may end up with load after the load balancing attempt on itself.
This aborts nohz idle balancing. As a result several idle CPUs are left
without tasks till such a time that an ILB CPU finds it unfavorable to
pull tasks upon itself. This delays spreading of load across idle CPUs
and worse, clutters only a few CPUs with tasks.

The effect of the above problem was observed on an SMT8 POWER server
with 2 levels of numa domains. Busy loops equal to number of cores were
spawned. Since load balancing on fork/exec is discouraged across numa
domains, all busy loops would start on one of the numa domains. However
it was expected that eventually one busy loop would run per core across
all domains due to nohz idle load balancing. But it was observed that it
took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the
following:

1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
load balancing [Given the experiment, upto 6 CPUs per core could be
potentially idle in this domain.]

2. However the ILB CPU would call load_balance() on itself before
initiating nohz idle load balancing.

3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
tasks from its sibling cores to even out load.

4. Now that the ILB CPU was no longer idle, it would abort nohz idle
load balancing

As a result the opportunities to spread load across numa domains were
lost until such a time that the cores within the first numa domain had
equal number of tasks among themselves.  This is a pretty bad scenario,
since the cores within the first numa domain would have as many as 4
tasks each, while cores in the neighbouring numa domains would all
remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load
balancing, before it does load balancing upon itself. This way we allow
idle CPUs across the system to do load balancing which results in
quicker spread of load, instead of performing load balancing within the
local sched domain hierarchy of the ILB CPU alone under circumstances
such as above.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
Changes from V1:
1. Added relevant comments
2. Wrapped lines to a fixed width in the changelog

 kernel/sched/fair.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcfe320..8b6d0d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;

-	rebalance_domains(this_rq, idle);
-
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
-	 * stopped.
+	 * stopped. Do nohz_idle_balance *before* rebalance_domains to
+	 * give the idle cpus a chance to load balance. Else we may
+	 * load balance only within the local sched_domain hierarchy
+	 * and abort nohz_idle_balance altogether if we pull some load.
 	 */
 	nohz_idle_balance(this_rq, idle);
+	rebalance_domains(this_rq, idle);
 }

 /*

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
@ 2015-03-26 17:03 ` Jason Low
  2015-03-27  2:12 ` Wanpeng Li
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Jason Low @ 2015-03-26 17:03 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, morten.rasmussen, jason.low2

On Thu, 2015-03-26 at 18:32 +0530, Preeti U Murthy wrote:
>  kernel/sched/fair.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcfe320..8b6d0d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
>  	enum cpu_idle_type idle = this_rq->idle_balance ?
>  						CPU_IDLE : CPU_NOT_IDLE;
>  
> -	rebalance_domains(this_rq, idle);
> -
>  	/*
>  	 * If this cpu has a pending nohz_balance_kick, then do the
>  	 * balancing on behalf of the other idle cpus whose ticks are
> -	 * stopped.
> +	 * stopped. Do nohz_idle_balance *before* rebalance_domains to
> +	 * give the idle cpus a chance to load balance. Else we may
> +	 * load balance only within the local sched_domain hierarchy
> +	 * and abort nohz_idle_balance altogether if we pull some load.
>  	 */
>  	nohz_idle_balance(this_rq, idle);
> +	rebalance_domains(this_rq, idle);

Reviewed-by: Jason Low <jason.low2@hp.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
  2015-03-26 17:03 ` Jason Low
@ 2015-03-27  2:12 ` Wanpeng Li
  2015-03-27  4:33   ` Preeti U Murthy
  2015-03-27  5:07   ` Jason Low
  2015-03-27 11:43 ` [tip:sched/core] " tip-bot for Preeti U Murthy
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 33+ messages in thread
From: Wanpeng Li @ 2015-03-27  2:12 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, morten.rasmussen, jason.low2, Wanpeng Li

Hi Preeti,
On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>
>1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>load balancing [Given the experiment, upto 6 CPUs per core could be
>potentially idle in this domain.]
>
>2. However the ILB CPU would call load_balance() on itself before
>initiating nohz idle load balancing.
>
>3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>tasks from its sibling cores to even out load.
>
>4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>load balancing

I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
in nohz_idle_balance(), could you explain more in details?

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  4:33   ` Preeti U Murthy
@ 2015-03-27  4:23     ` Wanpeng Li
  2015-03-27  5:01     ` Jason Low
  1 sibling, 0 replies; 33+ messages in thread
From: Wanpeng Li @ 2015-03-27  4:23 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, morten.rasmussen, jason.low2, Wanpeng Li

Hi Preeti,
On Fri, Mar 27, 2015 at 10:03:21AM +0530, Preeti U Murthy wrote:
>is set to CPU_NOT_IDLE.
>
>""
>idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
>
>And,
>
>When nohz_idle_balance() is called, the state of idle of ILB CPU is
>checked before proceeding with load balancing on idle CPUs.
>
>""
>
>if (idle != CPU_IDLE ||
>             !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
>                 goto end;
>

This idle is the one caculated in run_rebalance_domains() before invoke 
rebalance_domains(), rebalance_domains() cannot modify this local
variable.

Regards,
Wanpeng Li 

>You see that nohz idle balancing gets aborted.
>
>Regards
>Preeti U Murthy
>
>
>
>> 
>> Regards,
>> Wanpeng Li 
>> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  2:12 ` Wanpeng Li
@ 2015-03-27  4:33   ` Preeti U Murthy
  2015-03-27  4:23     ` Wanpeng Li
  2015-03-27  5:01     ` Jason Low
  2015-03-27  5:07   ` Jason Low
  1 sibling, 2 replies; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-27  4:33 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, morten.rasmussen, jason.low2

Hi Wanpeng

On 03/27/2015 07:42 AM, Wanpeng Li wrote:
> Hi Preeti,
> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>>
>> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>> load balancing [Given the experiment, upto 6 CPUs per core could be
>> potentially idle in this domain.]
>>
>> 2. However the ILB CPU would call load_balance() on itself before
>> initiating nohz idle load balancing.
>>
>> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>> tasks from its sibling cores to even out load.
>>
>> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>> load balancing
> 
> I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
> in nohz_idle_balance(), could you explain more in details?

When the ILB CPU pulls load in rebalance_domains(), its idle state
is set to CPU_NOT_IDLE.

""
idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;

And,

When nohz_idle_balance() is called, the state of idle of ILB CPU is
checked before proceeding with load balancing on idle CPUs.

""

if (idle != CPU_IDLE ||
             !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
                 goto end;

You see that nohz idle balancing gets aborted.

Regards
Preeti U Murthy



> 
> Regards,
> Wanpeng Li 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  4:33   ` Preeti U Murthy
  2015-03-27  4:23     ` Wanpeng Li
@ 2015-03-27  5:01     ` Jason Low
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Low @ 2015-03-27  5:01 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Wanpeng Li, peterz, mingo, riel, daniel.lezcano, vincent.guittot,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, morten.rasmussen, jason.low2

On Fri, 2015-03-27 at 10:03 +0530, Preeti U Murthy wrote:
> Hi Wanpeng
> 
> On 03/27/2015 07:42 AM, Wanpeng Li wrote:
> > Hi Preeti,
> > On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
> >>
> >> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> >> load balancing [Given the experiment, upto 6 CPUs per core could be
> >> potentially idle in this domain.]
> >>
> >> 2. However the ILB CPU would call load_balance() on itself before
> >> initiating nohz idle load balancing.
> >>
> >> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> >> tasks from its sibling cores to even out load.
> >>
> >> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> >> load balancing
> > 
> > I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
> > in nohz_idle_balance(), could you explain more in details?
> 
> When the ILB CPU pulls load in rebalance_domains(), its idle state
> is set to CPU_NOT_IDLE.
> 
> ""
> idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;

Hi Preeti,

The "idle" variable is a local variable to the rebalance_domains()
function. In that case, that shouldn't have an affect on the idle value
that gets passed to nohz_idle_balance().


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  2:12 ` Wanpeng Li
  2015-03-27  4:33   ` Preeti U Murthy
@ 2015-03-27  5:07   ` Jason Low
  2015-03-27  5:39     ` Srikar Dronamraju
                       ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Jason Low @ 2015-03-27  5:07 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Preeti U Murthy, peterz, mingo, riel, daniel.lezcano,
	vincent.guittot, srikar, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen, morten.rasmussen, jason.low2

On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
> Hi Preeti,
> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
> >
> >1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> >load balancing [Given the experiment, upto 6 CPUs per core could be
> >potentially idle in this domain.]
> >
> >2. However the ILB CPU would call load_balance() on itself before
> >initiating nohz idle load balancing.
> >
> >3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> >tasks from its sibling cores to even out load.
> >
> >4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> >load balancing
> 
> I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
> in nohz_idle_balance(), could you explain more in details?

Hi Wanpeng,

In nohz_idle_balance(), there is a check for need_resched() so if the
cpu has something to run, it should exit nohz_idle_balance(), which may
cause it to not do the idle balancing on the other CPUs.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  5:07   ` Jason Low
@ 2015-03-27  5:39     ` Srikar Dronamraju
  2015-03-27  7:00       ` Wanpeng Li
  2015-03-27  6:43     ` Wanpeng Li
  2015-03-27 16:23     ` Preeti U Murthy
  2 siblings, 1 reply; 33+ messages in thread
From: Srikar Dronamraju @ 2015-03-27  5:39 UTC (permalink / raw)
  To: Jason Low
  Cc: Wanpeng Li, Preeti U Murthy, peterz, mingo, riel, daniel.lezcano,
	vincent.guittot, pjt, benh, efault, linux-kernel, iamjoonsoo.kim,
	svaidy, tim.c.chen, morten.rasmussen

* Jason Low <jason.low2@hp.com> [2015-03-26 22:07:21]:

> On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
> > Hi Preeti,
> > On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
> > >
> > >1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> > >load balancing [Given the experiment, upto 6 CPUs per core could be
> > >potentially idle in this domain.]
> > >
> > >2. However the ILB CPU would call load_balance() on itself before
> > >initiating nohz idle load balancing.
> > >
> > >3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> > >tasks from its sibling cores to even out load.
> > >
> > >4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> > >load balancing
> > 
> > I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
> > in nohz_idle_balance(), could you explain more in details?
> 
> Hi Wanpeng,
> 
> In nohz_idle_balance(), there is a check for need_resched() so if the
> cpu has something to run, it should exit nohz_idle_balance(), which may
> cause it to not do the idle balancing on the other CPUs.
> 

Yes, the need_resched() in nohz_idle_balance() would exit the
nohz_idle_balance if it has something to run. However I wonder if we
should move the need_resched check out of the for loop. i.e the
need_resched check should probably be there with the idle check.

With the current code when the ilb cpus are not free: 
- We would be updating the nohz.next_balance even through we havent done
  any load balance.
- We might iterate thro the nohz.idle_cpus_mask()s to find balance_cpus.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  5:07   ` Jason Low
  2015-03-27  5:39     ` Srikar Dronamraju
@ 2015-03-27  6:43     ` Wanpeng Li
  2015-03-27 16:23     ` Preeti U Murthy
  2 siblings, 0 replies; 33+ messages in thread
From: Wanpeng Li @ 2015-03-27  6:43 UTC (permalink / raw)
  To: Jason Low
  Cc: Preeti U Murthy, peterz, mingo, riel, daniel.lezcano,
	vincent.guittot, srikar, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen, morten.rasmussen, Wanpeng Li

On Thu, Mar 26, 2015 at 10:07:21PM -0700, Jason Low wrote:
>On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
>> Hi Preeti,
>> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>> >
>> >1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>> >load balancing [Given the experiment, upto 6 CPUs per core could be
>> >potentially idle in this domain.]
>> >
>> >2. However the ILB CPU would call load_balance() on itself before
>> >initiating nohz idle load balancing.
>> >
>> >3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>> >tasks from its sibling cores to even out load.
>> >
>> >4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>> >load balancing
>> 
>> I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
>> in nohz_idle_balance(), could you explain more in details?
>
>Hi Wanpeng,
>
>In nohz_idle_balance(), there is a check for need_resched() so if the
>cpu has something to run, it should exit nohz_idle_balance(), which may
>cause it to not do the idle balancing on the other CPUs.

Got it, thanks. ;)

Regards,
Wanpeng Li 

>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  5:39     ` Srikar Dronamraju
@ 2015-03-27  7:00       ` Wanpeng Li
  0 siblings, 0 replies; 33+ messages in thread
From: Wanpeng Li @ 2015-03-27  7:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Jason Low, Wanpeng Li, Preeti U Murthy, peterz, mingo, riel,
	daniel.lezcano, vincent.guittot, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen, morten.rasmussen

Hi Srikar,
On Fri, Mar 27, 2015 at 11:09:07AM +0530, Srikar Dronamraju wrote:
>
>Yes, the need_resched() in nohz_idle_balance() would exit the
>nohz_idle_balance if it has something to run. However I wonder if we
>should move the need_resched check out of the for loop. i.e the
>need_resched check should probably be there with the idle check.
>
>With the current code when the ilb cpus are not free: 
>- We would be updating the nohz.next_balance even through we havent done
>  any load balance.
>- We might iterate thro the nohz.idle_cpus_mask()s to find balance_cpus.

Good idea, the nohz load balance will be delay since nohz.next_balance is 
updated inapposite when current ILB just be busy. I will send a patch to 
fix it. ;)

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [tip:sched/core] sched: Improve load balancing in the presence of idle CPUs
  2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
  2015-03-26 17:03 ` Jason Low
  2015-03-27  2:12 ` Wanpeng Li
@ 2015-03-27 11:43 ` tip-bot for Preeti U Murthy
  2015-03-27 13:03 ` [PATCH V2] " Srikar Dronamraju
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: tip-bot for Preeti U Murthy @ 2015-03-27 11:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, peterz, jason.low2, preeti, hpa, linux-kernel

Commit-ID:  d4573c3e1c992668f5dcd57d1c2ced56ae9650b9
Gitweb:     http://git.kernel.org/tip/d4573c3e1c992668f5dcd57d1c2ced56ae9650b9
Author:     Preeti U Murthy <preeti@linux.vnet.ibm.com>
AuthorDate: Thu, 26 Mar 2015 18:32:44 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:09 +0100

sched: Improve load balancing in the presence of idle CPUs

When a CPU is kicked to do nohz idle balancing, it wakes up to do load
balancing on itself, followed by load balancing on behalf of idle CPUs.
But it may end up with load after the load balancing attempt on itself.
This aborts nohz idle balancing. As a result several idle CPUs are left
without tasks till such a time that an ILB CPU finds it unfavorable to
pull tasks upon itself. This delays spreading of load across idle CPUs
and worse, clutters only a few CPUs with tasks.

The effect of the above problem was observed on an SMT8 POWER server
with 2 levels of numa domains. Busy loops equal to number of cores were
spawned. Since load balancing on fork/exec is discouraged across numa
domains, all busy loops would start on one of the numa domains. However
it was expected that eventually one busy loop would run per core across
all domains due to nohz idle load balancing. But it was observed that it
took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the
following:

 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
    load balancing [Given the experiment, upto 6 CPUs per core could be
    potentially idle in this domain.]

 2. However the ILB CPU would call load_balance() on itself before
    initiating nohz idle load balancing.

 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
    tasks from its sibling cores to even out load.

 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
    load balancing

As a result the opportunities to spread load across numa domains were
lost until such a time that the cores within the first numa domain had
equal number of tasks among themselves.  This is a pretty bad scenario,
since the cores within the first numa domain would have as many as 4
tasks each, while cores in the neighbouring numa domains would all
remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load
balancing, before it does load balancing upon itself. This way we allow
idle CPUs across the system to do load balancing which results in
quicker spread of load, instead of performing load balancing within the
local sched domain hierarchy of the ILB CPU alone under circumstances
such as above.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jason Low <jason.low2@hp.com>
Cc: benh@kernel.crashing.org
Cc: daniel.lezcano@linaro.org
Cc: efault@gmx.de
Cc: iamjoonsoo.kim@lge.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: riel@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: svaidy@linux.vnet.ibm.com
Cc: tim.c.chen@linux.intel.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a798ec..46855d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7753,14 +7753,16 @@ static void run_rebalance_domains(struct softirq_action *h)
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;

-	rebalance_domains(this_rq, idle);
-
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
-	 * stopped.
+	 * stopped. Do nohz_idle_balance *before* rebalance_domains to
+	 * give the idle cpus a chance to load balance. Else we may
+	 * load balance only within the local sched_domain hierarchy
+	 * and abort nohz_idle_balance altogether if we pull some load.
 	 */
 	nohz_idle_balance(this_rq, idle);
+	rebalance_domains(this_rq, idle);
 }

 /*

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
                   ` (2 preceding siblings ...)
  2015-03-27 11:43 ` [tip:sched/core] " tip-bot for Preeti U Murthy
@ 2015-03-27 13:03 ` Srikar Dronamraju
  2015-03-27 14:38 ` Morten Rasmussen
  2015-03-30 13:45 ` Vincent Guittot
  5 siblings, 0 replies; 33+ messages in thread
From: Srikar Dronamraju @ 2015-03-27 13:03 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, pjt, benh,
	efault, linux-kernel, iamjoonsoo.kim, svaidy, tim.c.chen,
	morten.rasmussen, jason.low2

> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
> balancing on itself, followed by load balancing on behalf of idle CPUs.
> But it may end up with load after the load balancing attempt on itself.
> This aborts nohz idle balancing. As a result several idle CPUs are left
> without tasks till such a time that an ILB CPU finds it unfavorable to
> pull tasks upon itself. This delays spreading of load across idle CPUs
> and worse, clutters only a few CPUs with tasks.
> 

[..... snipped .... ]

> Fix this, by checking if a CPU was woken up to do nohz idle load
> balancing, before it does load balancing upon itself. This way we allow
> idle CPUs across the system to do load balancing which results in
> quicker spread of load, instead of performing load balancing within the
> local sched domain hierarchy of the ILB CPU alone under circumstances
> such as above.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---


Looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
                   ` (3 preceding siblings ...)
  2015-03-27 13:03 ` [PATCH V2] " Srikar Dronamraju
@ 2015-03-27 14:38 ` Morten Rasmussen
  2015-03-27 16:46   ` Preeti U Murthy
  2015-03-30 13:45 ` Vincent Guittot
  5 siblings, 1 reply; 33+ messages in thread
From: Morten Rasmussen @ 2015-03-27 14:38 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

Hi Preeti,

On Thu, Mar 26, 2015 at 01:02:44PM +0000, Preeti U Murthy wrote:
> Fix this, by checking if a CPU was woken up to do nohz idle load
> balancing, before it does load balancing upon itself. This way we allow
> idle CPUs across the system to do load balancing which results in
> quicker spread of load, instead of performing load balancing within the
> local sched domain hierarchy of the ILB CPU alone under circumstances
> such as above.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
> Changes from V1:
> 1. Added relevant comments
> 2. Wrapped lines to a fixed width in the changelog
> 
>  kernel/sched/fair.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcfe320..8b6d0d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
>  	enum cpu_idle_type idle = this_rq->idle_balance ?
>  						CPU_IDLE : CPU_NOT_IDLE;
>  
> -	rebalance_domains(this_rq, idle);
> -
>  	/*
>  	 * If this cpu has a pending nohz_balance_kick, then do the
>  	 * balancing on behalf of the other idle cpus whose ticks are
> -	 * stopped.
> +	 * stopped. Do nohz_idle_balance *before* rebalance_domains to
> +	 * give the idle cpus a chance to load balance. Else we may
> +	 * load balance only within the local sched_domain hierarchy
> +	 * and abort nohz_idle_balance altogether if we pull some load.
>  	 */
>  	nohz_idle_balance(this_rq, idle);
> +	rebalance_domains(this_rq, idle);

IIUC, this change means that you will always wake up one more cpu than
necessary unless you have enough work for all cpus in the system. For
example, cpu0 is busy with two tasks and cpu1+2 are nohz_idle. cpu0
kicks cpu1 to do a nohz_idle_balance(). With the change it will balance
on behalf of cpu2 first and pull one of the tasks from cpu0. When done
with nohz_idle_balance() cpu1 has nothing left to pull when balancing
itself and goes back to sleep.

My concern is that this will increase the number of cpu wakeups quite
significantly. Am I missing something?

Morten

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27  5:07   ` Jason Low
  2015-03-27  5:39     ` Srikar Dronamraju
  2015-03-27  6:43     ` Wanpeng Li
@ 2015-03-27 16:23     ` Preeti U Murthy
  2 siblings, 0 replies; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-27 16:23 UTC (permalink / raw)
  To: Jason Low, Wanpeng Li
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, morten.rasmussen, Gautham R. Shenoy

Hi Wanpeng, Jason,

On 03/27/2015 10:37 AM, Jason Low wrote:
> On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
>> Hi Preeti,
>> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>>>
>>> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>>> load balancing [Given the experiment, upto 6 CPUs per core could be
>>> potentially idle in this domain.]
>>>
>>> 2. However the ILB CPU would call load_balance() on itself before
>>> initiating nohz idle load balancing.
>>>
>>> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>>> tasks from its sibling cores to even out load.
>>>
>>> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>>> load balancing
>>
>> I don't see abort nohz idle load balancing when ILB CPU was no longer idle 
>> in nohz_idle_balance(), could you explain more in details?
> 
> Hi Wanpeng,
> 
> In nohz_idle_balance(), there is a check for need_resched() so if the
> cpu has something to run, it should exit nohz_idle_balance(), which may
> cause it to not do the idle balancing on the other CPUs.

You are right, it is need_resched() causing the issue. I am sorry I
overlooked that 'idle' was not a pointer. Thanks a lot for pointing out
the issue :)

But the patch still does good as you will agree, and the changelog has
no specific mention to the 'idle' parameter but describes the issue in
general. Hence, we are good.

Regards
Preeti U Murthy

> 
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27 14:38 ` Morten Rasmussen
@ 2015-03-27 16:46   ` Preeti U Murthy
  2015-03-27 17:56     ` Morten Rasmussen
  0 siblings, 1 reply; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-27 16:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

Hi Morten,

On 03/27/2015 08:08 PM, Morten Rasmussen wrote:
> Hi Preeti,
> 
> On Thu, Mar 26, 2015 at 01:02:44PM +0000, Preeti U Murthy wrote:
>> Fix this, by checking if a CPU was woken up to do nohz idle load
>> balancing, before it does load balancing upon itself. This way we allow
>> idle CPUs across the system to do load balancing which results in
>> quicker spread of load, instead of performing load balancing within the
>> local sched domain hierarchy of the ILB CPU alone under circumstances
>> such as above.
>>
>> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>> ---
>> Changes from V1:
>> 1. Added relevant comments
>> 2. Wrapped lines to a fixed width in the changelog
>>
>>  kernel/sched/fair.c |    8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bcfe320..8b6d0d5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
>>  	enum cpu_idle_type idle = this_rq->idle_balance ?
>>  						CPU_IDLE : CPU_NOT_IDLE;
>>  
>> -	rebalance_domains(this_rq, idle);
>> -
>>  	/*
>>  	 * If this cpu has a pending nohz_balance_kick, then do the
>>  	 * balancing on behalf of the other idle cpus whose ticks are
>> -	 * stopped.
>> +	 * stopped. Do nohz_idle_balance *before* rebalance_domains to
>> +	 * give the idle cpus a chance to load balance. Else we may
>> +	 * load balance only within the local sched_domain hierarchy
>> +	 * and abort nohz_idle_balance altogether if we pull some load.
>>  	 */
>>  	nohz_idle_balance(this_rq, idle);
>> +	rebalance_domains(this_rq, idle);
> 
> IIUC, this change means that you will always wake up one more cpu than
> necessary unless you have enough work for all cpus in the system. For
> example, cpu0 is busy with two tasks and cpu1+2 are nohz_idle. cpu0
> kicks cpu1 to do a nohz_idle_balance(). With the change it will balance
> on behalf of cpu2 first and pull one of the tasks from cpu0. When done
> with nohz_idle_balance() cpu1 has nothing left to pull when balancing
> itself and goes back to sleep.
> 
> My concern is that this will increase the number of cpu wakeups quite
> significantly. Am I missing something?

Its true that we wakeup all idle CPUs. But I think we are justified in
doing so, given that nohz_idle_balance() was deemed necessary. The logic
behind nohz_idle_balance() as I see it is that, all idle CPUs should be
brought to action when some scheduling group is found to be busy. With
the current code that does not happen if one of them happen to pull
tasks. This logic does not make sense to me.

With the current code, I think it is hard to estimate how many idle CPU
wakeups would be sufficient to balance out the system load. But I
certainly feel that waking up all of them to perform the load balancing
that was asked for, is far better than waking up none of them. This is
in favor of performance. I agree the extra wakeups would do us harm with
power savings, but I would still be fine with it, given the undesirable
scenario that occurs as a consequence, as described in the changelog.

Regards
Preeti U Murthy
> 
> Morten
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27 16:46   ` Preeti U Murthy
@ 2015-03-27 17:56     ` Morten Rasmussen
  2015-03-30  7:26       ` Preeti U Murthy
  2015-03-30 11:06       ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Morten Rasmussen @ 2015-03-27 17:56 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Fri, Mar 27, 2015 at 04:46:30PM +0000, Preeti U Murthy wrote:
> Hi Morten,
> 
> On 03/27/2015 08:08 PM, Morten Rasmussen wrote:
> > Hi Preeti,
> > 
> > On Thu, Mar 26, 2015 at 01:02:44PM +0000, Preeti U Murthy wrote:
> >> Fix this, by checking if a CPU was woken up to do nohz idle load
> >> balancing, before it does load balancing upon itself. This way we allow
> >> idle CPUs across the system to do load balancing which results in
> >> quicker spread of load, instead of performing load balancing within the
> >> local sched domain hierarchy of the ILB CPU alone under circumstances
> >> such as above.
> >>
> >> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> >> ---
> >> Changes from V1:
> >> 1. Added relevant comments
> >> 2. Wrapped lines to a fixed width in the changelog
> >>
> >>  kernel/sched/fair.c |    8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index bcfe320..8b6d0d5 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
> >>  	enum cpu_idle_type idle = this_rq->idle_balance ?
> >>  						CPU_IDLE : CPU_NOT_IDLE;
> >>  
> >> -	rebalance_domains(this_rq, idle);
> >> -
> >>  	/*
> >>  	 * If this cpu has a pending nohz_balance_kick, then do the
> >>  	 * balancing on behalf of the other idle cpus whose ticks are
> >> -	 * stopped.
> >> +	 * stopped. Do nohz_idle_balance *before* rebalance_domains to
> >> +	 * give the idle cpus a chance to load balance. Else we may
> >> +	 * load balance only within the local sched_domain hierarchy
> >> +	 * and abort nohz_idle_balance altogether if we pull some load.
> >>  	 */
> >>  	nohz_idle_balance(this_rq, idle);
> >> +	rebalance_domains(this_rq, idle);
> > 
> > IIUC, this change means that you will always wake up one more cpu than
> > necessary unless you have enough work for all cpus in the system. For
> > example, cpu0 is busy with two tasks and cpu1+2 are nohz_idle. cpu0
> > kicks cpu1 to do a nohz_idle_balance(). With the change it will balance
> > on behalf of cpu2 first and pull one of the tasks from cpu0. When done
> > with nohz_idle_balance() cpu1 has nothing left to pull when balancing
> > itself and goes back to sleep.
> > 
> > My concern is that this will increase the number of cpu wakeups quite
> > significantly. Am I missing something?
> 
> Its true that we wakeup all idle CPUs. But I think we are justified in
> doing so, given that nohz_idle_balance() was deemed necessary. The logic
> behind nohz_idle_balance() as I see it is that, all idle CPUs should be
> brought to action when some scheduling group is found to be busy. With
> the current code that does not happen if one of them happen to pull
> tasks. This logic does not make sense to me.
> 
> With the current code, I think it is hard to estimate how many idle CPU
> wakeups would be sufficient to balance out the system load. But I
> certainly feel that waking up all of them to perform the load balancing
> that was asked for, is far better than waking up none of them. This is
> in favor of performance. I agree the extra wakeups would do us harm with
> power savings, but I would still be fine with it, given the undesirable
> scenario that occurs as a consequence, as described in the changelog.

I agree that the current behaviour is undesirable and should be fixed,
but IMHO waking up all idle cpus can not be justified. It is only one
additional cpu though with your patch so it isn't quite that bad.

I agree that it is hard to predict how many additional cpus you need,
but I don't think you necessarily need that information as long as you
start by filling up the cpu that was kicked to do the
nohz_idle_balance() first.

You would also solve your problem if you removed the ability for the cpu
to bail out after balancing itself and force it to finish the job. It
would mean harming tasks that where pulled to the balancing cpu as they
would have to wait being scheduling until the nohz_idle_balance() has
completed. It could be a price worth paying.

An alternative could be to let the balancing cpu balance itself first
and bail out as it currently does, but let it kick the next nohz_idle
cpu to continue the job if it thinks there is more work to be done. So
you would get a chain of kicks that would stop when there nothing
more to do be done. It isn't quite as fast as your solution as it would
require an IPI plus wakeup for each cpu to continue the work. But it
should be much faster than the current code I think.

IMHO it makes more sense to stay with the current scheme of ensuring
that the kicked cpu is actually used before waking up more cpus and
instead improve how additional cpus are kicked if they are needed.

Reducing unnecessary wakeups is quite important for energy consumption
and something a lot of effort is put into. You really don't want to wake
up another cluster/package unnecessarily just because there was only one
nohz-idle cpu left in the previous one which could have handled the
additional load. It gets even worse if the other cluster is less
energy-efficient (big.LITTLE).

Morten

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27 17:56     ` Morten Rasmussen
@ 2015-03-30  7:26       ` Preeti U Murthy
  2015-03-30 11:30         ` Morten Rasmussen
  2015-03-30 11:06       ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-30  7:26 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

Hi Morten,

On 03/27/2015 11:26 PM, Morten Rasmussen wrote:
> 
> I agree that the current behaviour is undesirable and should be fixed,
> but IMHO waking up all idle cpus can not be justified. It is only one
> additional cpu though with your patch so it isn't quite that bad.
> 
> I agree that it is hard to predict how many additional cpus you need,
> but I don't think you necessarily need that information as long as you
> start by filling up the cpu that was kicked to do the
> nohz_idle_balance() first.
> 
> You would also solve your problem if you removed the ability for the cpu
> to bail out after balancing itself and force it to finish the job. It
> would mean harming tasks that where pulled to the balancing cpu as they
> would have to wait being scheduling until the nohz_idle_balance() has
> completed. It could be a price worth paying.

But how would this prevent waking up idle CPUs ? You still end up waking
up all idle CPUs, wouldn't you?

> 
> An alternative could be to let the balancing cpu balance itself first
> and bail out as it currently does, but let it kick the next nohz_idle
> cpu to continue the job if it thinks there is more work to be done. So
> you would get a chain of kicks that would stop when there nothing
> more to do be done. It isn't quite as fast as your solution as it would

I am afraid there is more to this. If a given CPU is unable to pull
tasks, it could mean that it is an unworthy destination CPU. But it does
not mean that the other idle CPUs are unworthy of balancing too.

So if the ILB CPU stops waking up idle CPUs when it has nothing to pull,
we will end up hurting load balancing. Take for example the scenario
described in the changelog. The idle CPUs within a numa node may find
load balanced within themselves and hence refrain from pulling any load.
If these ILB CPUs stop nohz idle load balancing at this point, the load
will never get spread across nodes.

If on the other hand, if we keep kicking idle CPUs to carry on idle load
balancing, the wakeup scenario will be no better than it is with this patch.

> require an IPI plus wakeup for each cpu to continue the work. But it
> should be much faster than the current code I think.
> 
> IMHO it makes more sense to stay with the current scheme of ensuring
> that the kicked cpu is actually used before waking up more cpus and
> instead improve how additional cpus are kicked if they are needed.

It looks more sensible to do this in parallel. The scenario on POWER is
that tasks don't spread out across nodes until 10s of fork. This is
unforgivable and we cannot afford the code to be the way it is today.

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-27 17:56     ` Morten Rasmussen
  2015-03-30  7:26       ` Preeti U Murthy
@ 2015-03-30 11:06       ` Peter Zijlstra
  2015-03-30 12:03         ` Morten Rasmussen
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2015-03-30 11:06 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Preeti U Murthy, mingo, riel, daniel.lezcano, vincent.guittot,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:

> I agree that it is hard to predict how many additional cpus you need,
> but I don't think you necessarily need that information as long as you
> start by filling up the cpu that was kicked to do the
> nohz_idle_balance() first.

> Reducing unnecessary wakeups is quite important for energy consumption
> and something a lot of effort is put into. You really don't want to wake
> up another cluster/package unnecessarily just because there was only one
> nohz-idle cpu left in the previous one which could have handled the
> additional load. It gets even worse if the other cluster is less
> energy-efficient (big.LITTLE).

So the only way to get tasks to cross your cluster is by balancing that
domain. At this point we'll compute sg stats for either group
(=cluster).

The only thing we need to ensure is that it doesn't view the small
cluster as overloaded (as long as it really isn't of course), as long as
its not viewed as overloaded it will not pull tasks from it into the big
cluster, no matter how many ILBs we run before the ILB duty cpu's
rebalance_domains() call.

I'm really not seeing the problem here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30  7:26       ` Preeti U Murthy
@ 2015-03-30 11:30         ` Morten Rasmussen
  0 siblings, 0 replies; 33+ messages in thread
From: Morten Rasmussen @ 2015-03-30 11:30 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz, mingo, riel, daniel.lezcano, vincent.guittot, srikar,
	pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Mon, Mar 30, 2015 at 08:26:19AM +0100, Preeti U Murthy wrote:
> Hi Morten,
> 
> On 03/27/2015 11:26 PM, Morten Rasmussen wrote:
> > 
> > I agree that the current behaviour is undesirable and should be fixed,
> > but IMHO waking up all idle cpus can not be justified. It is only one
> > additional cpu though with your patch so it isn't quite that bad.
> > 
> > I agree that it is hard to predict how many additional cpus you need,
> > but I don't think you necessarily need that information as long as you
> > start by filling up the cpu that was kicked to do the
> > nohz_idle_balance() first.
> > 
> > You would also solve your problem if you removed the ability for the cpu
> > to bail out after balancing itself and force it to finish the job. It
> > would mean harming tasks that where pulled to the balancing cpu as they
> > would have to wait being scheduling until the nohz_idle_balance() has
> > completed. It could be a price worth paying.
> 
> But how would this prevent waking up idle CPUs ? You still end up waking
> up all idle CPUs, wouldn't you?

That depends on the scenario. In your example from the changelog you
would. You have enough work for all the nohz-idle cpus and you will keep
iterating through all of them and pull work on their behalf and hence
wake them up. But in a scenario where you don't have enough work for all
nohz-idle cpus you are guaranteed that the balancer cpu has taken its
share and doesn't go back to sleep immediately after finish
nohz_idle_balance(). So all cpus woken up will have a task to run
including the balancer cpu. 

> 
> > 
> > An alternative could be to let the balancing cpu balance itself first
> > and bail out as it currently does, but let it kick the next nohz_idle
> > cpu to continue the job if it thinks there is more work to be done. So
> > you would get a chain of kicks that would stop when there nothing
> > more to do be done. It isn't quite as fast as your solution as it would
> 
> I am afraid there is more to this. If a given CPU is unable to pull
> tasks, it could mean that it is an unworthy destination CPU. But it does
> not mean that the other idle CPUs are unworthy of balancing too.
> 
> So if the ILB CPU stops waking up idle CPUs when it has nothing to pull,
> we will end up hurting load balancing. Take for example the scenario
> described in the changelog. The idle CPUs within a numa node may find
> load balanced within themselves and hence refrain from pulling any load.
> If these ILB CPUs stop nohz idle load balancing at this point, the load
> will never get spread across nodes.
> 
> If on the other hand, if we keep kicking idle CPUs to carry on idle load
> balancing, the wakeup scenario will be no better than it is with this patch.

By more work to be done I didn't mean stopping when the balancer cpu
gives up, I meant stopping the kick chain when there nothing more to be
balanced/pulled (or reasonably close). For example use something like
the nohz_kick_needed() checks on the source cpu/group and stopping if
all cpus only have one runnable task. At least try to stop waking an
extra cpu when there is clearly no point in doing so.

> > require an IPI plus wakeup for each cpu to continue the work. But it
> > should be much faster than the current code I think.
> > 
> > IMHO it makes more sense to stay with the current scheme of ensuring
> > that the kicked cpu is actually used before waking up more cpus and
> > instead improve how additional cpus are kicked if they are needed.
> 
> It looks more sensible to do this in parallel. The scenario on POWER is
> that tasks don't spread out across nodes until 10s of fork. This is
> unforgivable and we cannot afford the code to be the way it is today.

You propose having multiple balancing cpus running in parallel?

I fully agree that the nohz-balancing behaviour should be improved. It
could use a few changes to improve energy-awareness as well. IMHO,
taking a double hit every time we need to wake up an additional cpu is
inefficient and goes against all the effort put into reducing wake-ups
is essential for saving energy.

One thing that would help reducing energy consumption and that vendors
carry out-of-tree is improving find_new_ilb() to pick the cheapest cpu
to be kicked for nohz-idle balancing. However, this improvement is
pointless if we are going to wake up an additional cpu to receive the
task(s) that needs to be balanced.

I think a solution where we at least try to avoid waking up additional
cpus and vastly improves your 10s latency is possible.

Morten

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 11:06       ` Peter Zijlstra
@ 2015-03-30 12:03         ` Morten Rasmussen
  2015-03-30 12:24           ` Peter Zijlstra
  2015-03-31  8:58           ` Preeti U Murthy
  0 siblings, 2 replies; 33+ messages in thread
From: Morten Rasmussen @ 2015-03-30 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Preeti U Murthy, mingo, riel, daniel.lezcano, vincent.guittot,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
> 
> > I agree that it is hard to predict how many additional cpus you need,
> > but I don't think you necessarily need that information as long as you
> > start by filling up the cpu that was kicked to do the
> > nohz_idle_balance() first.
> 
> > Reducing unnecessary wakeups is quite important for energy consumption
> > and something a lot of effort is put into. You really don't want to wake
> > up another cluster/package unnecessarily just because there was only one
> > nohz-idle cpu left in the previous one which could have handled the
> > additional load. It gets even worse if the other cluster is less
> > energy-efficient (big.LITTLE).
> 
> So the only way to get tasks to cross your cluster is by balancing that
> domain. At this point we'll compute sg stats for either group
> (=cluster).
> 
> The only thing we need to ensure is that it doesn't view the small
> cluster as overloaded (as long as it really isn't of course), as long as
> its not viewed as overloaded it will not pull tasks from it into the big
> cluster, no matter how many ILBs we run before the ILB duty cpu's
> rebalance_domains() call.
> 
> I'm really not seeing the problem here.

I see. The group_classify() should take care of it in all cases of
balancing across clusters. You would be iterating over all cpus in the
other cluster running rebalance_domains() if the balancer cpu happens to
be the last one in the little cluster though. However, within the
cluster (in case you have 2 or more nohz-idle cpus) you still take a
double hit. No?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 12:03         ` Morten Rasmussen
@ 2015-03-30 12:24           ` Peter Zijlstra
  2015-03-30 12:54             ` Peter Zijlstra
  2015-03-30 13:29             ` Vincent Guittot
  2015-03-31  8:58           ` Preeti U Murthy
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2015-03-30 12:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Preeti U Murthy, mingo, riel, daniel.lezcano, vincent.guittot,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Mon, Mar 30, 2015 at 01:03:03PM +0100, Morten Rasmussen wrote:
> On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
> > 
> > > I agree that it is hard to predict how many additional cpus you need,
> > > but I don't think you necessarily need that information as long as you
> > > start by filling up the cpu that was kicked to do the
> > > nohz_idle_balance() first.
> > 
> > > Reducing unnecessary wakeups is quite important for energy consumption
> > > and something a lot of effort is put into. You really don't want to wake
> > > up another cluster/package unnecessarily just because there was only one
> > > nohz-idle cpu left in the previous one which could have handled the
> > > additional load. It gets even worse if the other cluster is less
> > > energy-efficient (big.LITTLE).
> > 
> > So the only way to get tasks to cross your cluster is by balancing that
> > domain. At this point we'll compute sg stats for either group
> > (=cluster).
> > 
> > The only thing we need to ensure is that it doesn't view the small
> > cluster as overloaded (as long as it really isn't of course), as long as
> > its not viewed as overloaded it will not pull tasks from it into the big
> > cluster, no matter how many ILBs we run before the ILB duty cpu's
> > rebalance_domains() call.
> > 
> > I'm really not seeing the problem here.
> 
> I see. The group_classify() should take care of it in all cases of
> balancing across clusters. You would be iterating over all cpus in the
> other cluster running rebalance_domains() if the balancer cpu happens to
> be the last one in the little cluster though. However, within the
> cluster (in case you have 2 or more nohz-idle cpus) you still take a
> double hit. No?

It can yes, but typically not I think. This all could use some 'help'
for sure.

So the thing is, find_new_ilb() simply selects the first idle_cpus_mask
cpu, while at the same time, nohz_idle_balance() will iterate the
idle_cpus_mask with the first, being first (obviously).

So it is very like that if we migrate on the ILB it is in fact to the
current CPU.

In case we cannot, we have no choice but to wake up a second idle,
nothing really to be done about that.

To put it another way, for ILB purposes the rebalance_domains() call is
mostly superfluous. The only other case is if the selected ILB target
became non-idle between being selected and getting to run the softirq
handler. At which point we should wake another anyhow too.

Maybe something like the below helps -- albeit it could use a comment
too of course.

---
 kernel/sched/fair.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..b879d4b3b599 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7624,11 +7624,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
 	int this_cpu = this_rq->cpu;
-	struct rq *rq;
 	int balance_cpu;
+	struct rq *rq;
+	bool done = false;
 
 	if (idle != CPU_IDLE ||
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
@@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 			break;
 
 		rq = cpu_rq(balance_cpu);
+		if (rq == this_rq)
+			done = true;
 
 		/*
 		 * If time for next balance is due,
@@ -7666,6 +7669,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	nohz.next_balance = this_rq->next_balance;
 end:
 	clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+
+	return done;
 }
 
 /*
@@ -7744,7 +7749,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	return kick;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { return false; }
 #endif
 
 /*
@@ -7765,8 +7770,8 @@ static void run_rebalance_domains(struct softirq_action *h)
 	 * load balance only within the local sched_domain hierarchy
 	 * and abort nohz_idle_balance altogether if we pull some load.
 	 */
-	nohz_idle_balance(this_rq, idle);
-	rebalance_domains(this_rq, idle);
+	if (!nohz_idle_balance(this_rq, idle))
+		rebalance_domains(this_rq, idle);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 12:24           ` Peter Zijlstra
@ 2015-03-30 12:54             ` Peter Zijlstra
  2015-03-30 13:29             ` Vincent Guittot
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2015-03-30 12:54 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Preeti U Murthy, mingo, riel, daniel.lezcano, vincent.guittot,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Mon, Mar 30, 2015 at 02:24:49PM +0200, Peter Zijlstra wrote:

> @@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  			break;
>  
>  		rq = cpu_rq(balance_cpu);
> +		if (rq == this_rq)
> +			done = true;
>  
>  		/*
>  		 * If time for next balance is due,

Equally note that if you change find_new_ilb() to not pick the very
first cpu in the mask you already get the bouncy bounce, because this
routine will in fact try and move things to the first cpu first.

So if you go change find_new_ilb() you should also amend the iteration
order here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 12:24           ` Peter Zijlstra
  2015-03-30 12:54             ` Peter Zijlstra
@ 2015-03-30 13:29             ` Vincent Guittot
  2015-03-30 14:01               ` Peter Zijlstra
  2015-03-30 15:27               ` Morten Rasmussen
  1 sibling, 2 replies; 33+ messages in thread
From: Vincent Guittot @ 2015-03-30 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, Preeti U Murthy, mingo, riel, daniel.lezcano,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On 30 March 2015 at 14:24, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 30, 2015 at 01:03:03PM +0100, Morten Rasmussen wrote:
>> On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
>> > On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
>> >
>> > > I agree that it is hard to predict how many additional cpus you need,
>> > > but I don't think you necessarily need that information as long as you
>> > > start by filling up the cpu that was kicked to do the
>> > > nohz_idle_balance() first.
>> >
>> > > Reducing unnecessary wakeups is quite important for energy consumption
>> > > and something a lot of effort is put into. You really don't want to wake
>> > > up another cluster/package unnecessarily just because there was only one
>> > > nohz-idle cpu left in the previous one which could have handled the
>> > > additional load. It gets even worse if the other cluster is less
>> > > energy-efficient (big.LITTLE).
>> >
>> > So the only way to get tasks to cross your cluster is by balancing that
>> > domain. At this point we'll compute sg stats for either group
>> > (=cluster).
>> >
>> > The only thing we need to ensure is that it doesn't view the small
>> > cluster as overloaded (as long as it really isn't of course), as long as
>> > its not viewed as overloaded it will not pull tasks from it into the big
>> > cluster, no matter how many ILBs we run before the ILB duty cpu's
>> > rebalance_domains() call.
>> >
>> > I'm really not seeing the problem here.
>>
>> I see. The group_classify() should take care of it in all cases of
>> balancing across clusters. You would be iterating over all cpus in the
>> other cluster running rebalance_domains() if the balancer cpu happens to
>> be the last one in the little cluster though. However, within the
>> cluster (in case you have 2 or more nohz-idle cpus) you still take a
>> double hit. No?
>
> It can yes, but typically not I think. This all could use some 'help'
> for sure.
>
> So the thing is, find_new_ilb() simply selects the first idle_cpus_mask
> cpu, while at the same time, nohz_idle_balance() will iterate the
> idle_cpus_mask with the first, being first (obviously).
>
> So it is very like that if we migrate on the ILB it is in fact to the
> current CPU.
>
> In case we cannot, we have no choice but to wake up a second idle,
> nothing really to be done about that.
>
> To put it another way, for ILB purposes the rebalance_domains() call is
> mostly superfluous. The only other case is if the selected ILB target
> became non-idle between being selected and getting to run the softirq
> handler. At which point we should wake another anyhow too.
>
> Maybe something like the below helps -- albeit it could use a comment
> too of course.
>
> ---
>  kernel/sched/fair.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..b879d4b3b599 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7624,11 +7624,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>   * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>   * rebalancing for all the cpus for whom scheduler ticks are stopped.
>   */
> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  {
>         int this_cpu = this_rq->cpu;
> -       struct rq *rq;
>         int balance_cpu;
> +       struct rq *rq;
> +       bool done = false;
>
>         if (idle != CPU_IDLE ||
>             !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
> @@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>                         break;
>
>                 rq = cpu_rq(balance_cpu);
> +               if (rq == this_rq)
> +                       done = true;

AFAICT, this can't happen because we start the for_each _cpu loop with:
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;

>
>                 /*
>                  * If time for next balance is due,
> @@ -7666,6 +7669,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>         nohz.next_balance = this_rq->next_balance;
>  end:
>         clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> +
> +       return done;
>  }
>
>  /*
> @@ -7744,7 +7749,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
>         return kick;
>  }
>  #else
> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { return false; }
>  #endif
>
>  /*
> @@ -7765,8 +7770,8 @@ static void run_rebalance_domains(struct softirq_action *h)
>          * load balance only within the local sched_domain hierarchy
>          * and abort nohz_idle_balance altogether if we pull some load.
>          */
> -       nohz_idle_balance(this_rq, idle);
> -       rebalance_domains(this_rq, idle);
> +       if (!nohz_idle_balance(this_rq, idle))
> +               rebalance_domains(this_rq, idle);

the nohz_idle_balance run rebalance_domains for all CPU except this CPU

>  }
>
>  /*

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
                   ` (4 preceding siblings ...)
  2015-03-27 14:38 ` Morten Rasmussen
@ 2015-03-30 13:45 ` Vincent Guittot
  2015-03-31  8:06   ` Preeti U Murthy
  5 siblings, 1 reply; 33+ messages in thread
From: Vincent Guittot @ 2015-03-30 13:45 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, Daniel Lezcano,
	Srikar Dronamraju, Paul Turner, Benjamin Herrenschmidt,
	Mike Galbraith, linux-kernel, Joonsoo Kim,
	Vaidyanathan Srinivasan, Tim Chen, Morten Rasmussen, Jason Low

On 26 March 2015 at 14:02, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
> balancing on itself, followed by load balancing on behalf of idle CPUs.
> But it may end up with load after the load balancing attempt on itself.
> This aborts nohz idle balancing. As a result several idle CPUs are left
> without tasks till such a time that an ILB CPU finds it unfavorable to
> pull tasks upon itself. This delays spreading of load across idle CPUs
> and worse, clutters only a few CPUs with tasks.
>
> The effect of the above problem was observed on an SMT8 POWER server
> with 2 levels of numa domains. Busy loops equal to number of cores were
> spawned. Since load balancing on fork/exec is discouraged across numa
> domains, all busy loops would start on one of the numa domains. However
> it was expected that eventually one busy loop would run per core across
> all domains due to nohz idle load balancing. But it was observed that it
> took as long as 10 seconds to spread the load across numa domains.

10sec is quite long. Have you checked how many load balance is needed
to spread the load on the system ? Are you using the default
min_interval and max_interval ?
The default period range is [sd_weight : 2*sd_weight] I don't know how
many CPUs you have but as an example, a system made of 128 CPUs will
do a load balance across the wide system each 128 jifffies if  the CPU
is idle and each 4096 jiffies on a busy CPU. This could explain why
you need so much time to spread task across the system.

Vincent

>
> Further investigation showed that this was a consequence of the
> following:
>
> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> load balancing [Given the experiment, upto 6 CPUs per core could be
> potentially idle in this domain.]
>
> 2. However the ILB CPU would call load_balance() on itself before
> initiating nohz idle load balancing.
>
> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> tasks from its sibling cores to even out load.
>
> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> load balancing
>
> As a result the opportunities to spread load across numa domains were
> lost until such a time that the cores within the first numa domain had
> equal number of tasks among themselves.  This is a pretty bad scenario,
> since the cores within the first numa domain would have as many as 4
> tasks each, while cores in the neighbouring numa domains would all
> remain idle.
>
> Fix this, by checking if a CPU was woken up to do nohz idle load
> balancing, before it does load balancing upon itself. This way we allow
> idle CPUs across the system to do load balancing which results in
> quicker spread of load, instead of performing load balancing within the
> local sched domain hierarchy of the ILB CPU alone under circumstances
> such as above.
>
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
> Changes from V1:
> 1. Added relevant comments
> 2. Wrapped lines to a fixed width in the changelog
>
>  kernel/sched/fair.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcfe320..8b6d0d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
>         enum cpu_idle_type idle = this_rq->idle_balance ?
>                                                 CPU_IDLE : CPU_NOT_IDLE;
>
> -       rebalance_domains(this_rq, idle);
> -
>         /*
>          * If this cpu has a pending nohz_balance_kick, then do the
>          * balancing on behalf of the other idle cpus whose ticks are
> -        * stopped.
> +        * stopped. Do nohz_idle_balance *before* rebalance_domains to
> +        * give the idle cpus a chance to load balance. Else we may
> +        * load balance only within the local sched_domain hierarchy
> +        * and abort nohz_idle_balance altogether if we pull some load.
>          */
>         nohz_idle_balance(this_rq, idle);
> +       rebalance_domains(this_rq, idle);
>  }
>
>  /*
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 13:29             ` Vincent Guittot
@ 2015-03-30 14:01               ` Peter Zijlstra
  2015-03-30 15:27               ` Morten Rasmussen
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2015-03-30 14:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, Preeti U Murthy, mingo, riel, daniel.lezcano,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Mon, Mar 30, 2015 at 03:29:09PM +0200, Vincent Guittot wrote:
> On 30 March 2015 at 14:24, Peter Zijlstra <peterz@infradead.org> wrote:
> > @@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >                         break;
> >
> >                 rq = cpu_rq(balance_cpu);
> > +               if (rq == this_rq)
> > +                       done = true;
> 
> AFAICT, this can't happen because we start the for_each _cpu loop with:
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;

Oh hey, look at me being blind ;-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 13:29             ` Vincent Guittot
  2015-03-30 14:01               ` Peter Zijlstra
@ 2015-03-30 15:27               ` Morten Rasmussen
  1 sibling, 0 replies; 33+ messages in thread
From: Morten Rasmussen @ 2015-03-30 15:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Preeti U Murthy, mingo, riel, daniel.lezcano,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen, jason.low2

On Mon, Mar 30, 2015 at 02:29:09PM +0100, Vincent Guittot wrote:
> On 30 March 2015 at 14:24, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Mar 30, 2015 at 01:03:03PM +0100, Morten Rasmussen wrote:
> >> On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
> >> > On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
> >> >
> >> > > I agree that it is hard to predict how many additional cpus you need,
> >> > > but I don't think you necessarily need that information as long as you
> >> > > start by filling up the cpu that was kicked to do the
> >> > > nohz_idle_balance() first.
> >> >
> >> > > Reducing unnecessary wakeups is quite important for energy consumption
> >> > > and something a lot of effort is put into. You really don't want to wake
> >> > > up another cluster/package unnecessarily just because there was only one
> >> > > nohz-idle cpu left in the previous one which could have handled the
> >> > > additional load. It gets even worse if the other cluster is less
> >> > > energy-efficient (big.LITTLE).
> >> >
> >> > So the only way to get tasks to cross your cluster is by balancing that
> >> > domain. At this point we'll compute sg stats for either group
> >> > (=cluster).
> >> >
> >> > The only thing we need to ensure is that it doesn't view the small
> >> > cluster as overloaded (as long as it really isn't of course), as long as
> >> > its not viewed as overloaded it will not pull tasks from it into the big
> >> > cluster, no matter how many ILBs we run before the ILB duty cpu's
> >> > rebalance_domains() call.
> >> >
> >> > I'm really not seeing the problem here.
> >>
> >> I see. The group_classify() should take care of it in all cases of
> >> balancing across clusters. You would be iterating over all cpus in the
> >> other cluster running rebalance_domains() if the balancer cpu happens to
> >> be the last one in the little cluster though. However, within the
> >> cluster (in case you have 2 or more nohz-idle cpus) you still take a
> >> double hit. No?
> >
> > It can yes, but typically not I think. This all could use some 'help'
> > for sure.
> >
> > So the thing is, find_new_ilb() simply selects the first idle_cpus_mask
> > cpu, while at the same time, nohz_idle_balance() will iterate the
> > idle_cpus_mask with the first, being first (obviously).
> >
> > So it is very like that if we migrate on the ILB it is in fact to the
> > current CPU.
> >
> > In case we cannot, we have no choice but to wake up a second idle,
> > nothing really to be done about that.
> >
> > To put it another way, for ILB purposes the rebalance_domains() call is
> > mostly superfluous. The only other case is if the selected ILB target
> > became non-idle between being selected and getting to run the softirq
> > handler. At which point we should wake another anyhow too.

Apart from the issues that Vincent has already pointed out, including
the balancing of this_cpu in nohz_idle_balance() also means that we are
no longer ignoring rq->next_balance for the cpu being kicked. So we risk
kicking a cpu to do nohz_idle_balance which might ignore the fact that
we need an ILB (the kicker should have determined that already) due to
having done a load-balance recently.

Also, if we include this_cpu in nohz_idle_balance() and pick the first
nohz-idle cpu (as we currently do) we are back to where we were before
Preeti's patch. The balancer cpu will bail out if it pulls anything for
itself and not kick anybody to take over.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 13:45 ` Vincent Guittot
@ 2015-03-31  8:06   ` Preeti U Murthy
  0 siblings, 0 replies; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-31  8:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, Daniel Lezcano,
	Srikar Dronamraju, Paul Turner, Benjamin Herrenschmidt,
	Mike Galbraith, linux-kernel, Joonsoo Kim,
	Vaidyanathan Srinivasan, Tim Chen, Morten Rasmussen, Jason Low

On 03/30/2015 07:15 PM, Vincent Guittot wrote:
> On 26 March 2015 at 14:02, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
>> balancing on itself, followed by load balancing on behalf of idle CPUs.
>> But it may end up with load after the load balancing attempt on itself.
>> This aborts nohz idle balancing. As a result several idle CPUs are left
>> without tasks till such a time that an ILB CPU finds it unfavorable to
>> pull tasks upon itself. This delays spreading of load across idle CPUs
>> and worse, clutters only a few CPUs with tasks.
>>
>> The effect of the above problem was observed on an SMT8 POWER server
>> with 2 levels of numa domains. Busy loops equal to number of cores were
>> spawned. Since load balancing on fork/exec is discouraged across numa
>> domains, all busy loops would start on one of the numa domains. However
>> it was expected that eventually one busy loop would run per core across
>> all domains due to nohz idle load balancing. But it was observed that it
>> took as long as 10 seconds to spread the load across numa domains.
> 
> 10sec is quite long. Have you checked how many load balance is needed
> to spread the load on the system ? Are you using the default

The issue was that load balancing was not even being *attempted* due to
the above mentioned reason. The ILB CPU would pull load and abort
nohz_idle_ld_bal which was the only way load balancing could be
triggered on the idle CPUs. So it would take long to call load balancing
on idle CPUs after which it was quick to spread the load. There was
after all a stark imbalance between the load across the nodes.

> min_interval and max_interval ?
> The default period range is [sd_weight : 2*sd_weight] I don't know how
> many CPUs you have but as an example, a system made of 128 CPUs will
> do a load balance across the wide system each 128 jifffies if  the CPU
> is idle and each 4096 jiffies on a busy CPU. This could explain why
> you need so much time to spread task across the system.

Yes I did suspect this in the beginning but I could reproduce the
problem even on a machine with few cores.

Regards
Preeti U Murthy
> 
> Vincent
> 
>>
>> Further investigation showed that this was a consequence of the
>> following:
>>
>> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>> load balancing [Given the experiment, upto 6 CPUs per core could be
>> potentially idle in this domain.]
>>
>> 2. However the ILB CPU would call load_balance() on itself before
>> initiating nohz idle load balancing.
>>
>> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>> tasks from its sibling cores to even out load.
>>
>> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>> load balancing
>>
>> As a result the opportunities to spread load across numa domains were
>> lost until such a time that the cores within the first numa domain had
>> equal number of tasks among themselves.  This is a pretty bad scenario,
>> since the cores within the first numa domain would have as many as 4
>> tasks each, while cores in the neighbouring numa domains would all
>> remain idle.
>>
>> Fix this, by checking if a CPU was woken up to do nohz idle load
>> balancing, before it does load balancing upon itself. This way we allow
>> idle CPUs across the system to do load balancing which results in
>> quicker spread of load, instead of performing load balancing within the
>> local sched domain hierarchy of the ILB CPU alone under circumstances
>> such as above.
>>
>> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>> ---
>> Changes from V1:
>> 1. Added relevant comments
>> 2. Wrapped lines to a fixed width in the changelog
>>
>>  kernel/sched/fair.c |    8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bcfe320..8b6d0d5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
>>         enum cpu_idle_type idle = this_rq->idle_balance ?
>>                                                 CPU_IDLE : CPU_NOT_IDLE;
>>
>> -       rebalance_domains(this_rq, idle);
>> -
>>         /*
>>          * If this cpu has a pending nohz_balance_kick, then do the
>>          * balancing on behalf of the other idle cpus whose ticks are
>> -        * stopped.
>> +        * stopped. Do nohz_idle_balance *before* rebalance_domains to
>> +        * give the idle cpus a chance to load balance. Else we may
>> +        * load balance only within the local sched_domain hierarchy
>> +        * and abort nohz_idle_balance altogether if we pull some load.
>>          */
>>         nohz_idle_balance(this_rq, idle);
>> +       rebalance_domains(this_rq, idle);
>>  }
>>
>>  /*
>>
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-30 12:03         ` Morten Rasmussen
  2015-03-30 12:24           ` Peter Zijlstra
@ 2015-03-31  8:58           ` Preeti U Murthy
  2015-03-31 17:30             ` Jason Low
  1 sibling, 1 reply; 33+ messages in thread
From: Preeti U Murthy @ 2015-03-31  8:58 UTC (permalink / raw)
  To: Morten Rasmussen, Peter Zijlstra
  Cc: mingo, riel, daniel.lezcano, vincent.guittot, srikar, pjt, benh,
	efault, linux-kernel, iamjoonsoo.kim, svaidy, tim.c.chen,
	jason.low2

On 03/30/2015 05:33 PM, Morten Rasmussen wrote:
> On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
>> On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
>>
>>> I agree that it is hard to predict how many additional cpus you need,
>>> but I don't think you necessarily need that information as long as you
>>> start by filling up the cpu that was kicked to do the
>>> nohz_idle_balance() first.
>>
>>> Reducing unnecessary wakeups is quite important for energy consumption
>>> and something a lot of effort is put into. You really don't want to wake
>>> up another cluster/package unnecessarily just because there was only one
>>> nohz-idle cpu left in the previous one which could have handled the
>>> additional load. It gets even worse if the other cluster is less
>>> energy-efficient (big.LITTLE).
>>
>> So the only way to get tasks to cross your cluster is by balancing that
>> domain. At this point we'll compute sg stats for either group
>> (=cluster).
>>
>> The only thing we need to ensure is that it doesn't view the small
>> cluster as overloaded (as long as it really isn't of course), as long as
>> its not viewed as overloaded it will not pull tasks from it into the big
>> cluster, no matter how many ILBs we run before the ILB duty cpu's
>> rebalance_domains() call.
>>
>> I'm really not seeing the problem here.
> 
> I see. The group_classify() should take care of it in all cases of
> balancing across clusters. You would be iterating over all cpus in the
> other cluster running rebalance_domains() if the balancer cpu happens to
> be the last one in the little cluster though. However, within the
> cluster (in case you have 2 or more nohz-idle cpus) you still take a
> double hit. No?

Morten,

I am a bit confused about the problem you are pointing to.
nohz_idle_balance() does not kick the idle CPUs into action unless there
is work to be done. So there are no redundant wakeups. Hence I see no
problem here.

The ILB CPU is woken up to do the nohz idle balancing, but with this
patch, may end up with no work for itself at the end of
nohz_idle_balance() and return to sleep. That is one wakeup for merely
doing idle load balancing, but this wakeup is needed and worthwhile for
spreading the load. So I find no problem here too.

I am unable to see the issue. What is it that I am missing ?

Regards
Preeti U Murthy


> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-31  8:58           ` Preeti U Murthy
@ 2015-03-31 17:30             ` Jason Low
  2015-04-01  6:28               ` Preeti U Murthy
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Low @ 2015-03-31 17:30 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, riel, daniel.lezcano,
	vincent.guittot, srikar, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen, jason.low2

On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:

> Morten,

> I am a bit confused about the problem you are pointing to.

> I am unable to see the issue. What is it that I am missing ?

Hi Preeti,

Here is one of the potential issues that have been described from my
understanding.

In situations where there are just a few tasks to pull (for example,
there's 1 task to move).

Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
this CPU 1 (which is already awake) and run the task on CPU 1.

Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
for the task to run. Meanwhile, CPU 1 may go idle, instead of running
the task on CPU 1 which was already awake.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-03-31 17:30             ` Jason Low
@ 2015-04-01  6:28               ` Preeti U Murthy
  2015-04-01 13:03                 ` Morten Rasmussen
  0 siblings, 1 reply; 33+ messages in thread
From: Preeti U Murthy @ 2015-04-01  6:28 UTC (permalink / raw)
  To: Jason Low, Morten Rasmussen
  Cc: Peter Zijlstra, mingo, riel, daniel.lezcano, vincent.guittot,
	srikar, pjt, benh, efault, linux-kernel, iamjoonsoo.kim, svaidy,
	tim.c.chen

On 03/31/2015 11:00 PM, Jason Low wrote:
> On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:
> 
>> Morten,
> 
>> I am a bit confused about the problem you are pointing to.
> 
>> I am unable to see the issue. What is it that I am missing ?
> 
> Hi Preeti,
> 
> Here is one of the potential issues that have been described from my
> understanding.
> 
> In situations where there are just a few tasks to pull (for example,
> there's 1 task to move).
> 
> Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
> this CPU 1 (which is already awake) and run the task on CPU 1.
> 
> Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
> for the task to run. Meanwhile, CPU 1 may go idle, instead of running
> the task on CPU 1 which was already awake.
> 

Alright I see. But it is one additional wake up. And the wake up will be
within the cluster. We will not wake up any CPU in the neighboring
cluster unless there are tasks to be pulled. So, we can wake up a core
out of a deep idle state and never a cluster in the problem described.
In terms of energy efficiency, this is not so bad a scenario, is it?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-04-01  6:28               ` Preeti U Murthy
@ 2015-04-01 13:03                 ` Morten Rasmussen
  2015-04-02  0:55                   ` Jason Low
  2015-04-02  3:22                   ` Preeti U Murthy
  0 siblings, 2 replies; 33+ messages in thread
From: Morten Rasmussen @ 2015-04-01 13:03 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Jason Low, Peter Zijlstra, mingo, riel, daniel.lezcano,
	vincent.guittot, srikar, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen

Hi Preeti and Jason,

On Wed, Apr 01, 2015 at 07:28:03AM +0100, Preeti U Murthy wrote:
> On 03/31/2015 11:00 PM, Jason Low wrote:
> > On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:
> > 
> >> Morten,
> > 
> >> I am a bit confused about the problem you are pointing to.
> > 
> >> I am unable to see the issue. What is it that I am missing ?
> > 
> > Hi Preeti,
> > 
> > Here is one of the potential issues that have been described from my
> > understanding.
> > 
> > In situations where there are just a few tasks to pull (for example,
> > there's 1 task to move).
> > 
> > Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
> > this CPU 1 (which is already awake) and run the task on CPU 1.
> > 
> > Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
> > for the task to run. Meanwhile, CPU 1 may go idle, instead of running
> > the task on CPU 1 which was already awake.

Yes. This is the scenario I had in mind although I might have failed to
make it crystal clear in my earlier replies.

> Alright I see. But it is one additional wake up. And the wake up will be
> within the cluster. We will not wake up any CPU in the neighboring
> cluster unless there are tasks to be pulled. So, we can wake up a core
> out of a deep idle state and never a cluster in the problem described.
> In terms of energy efficiency, this is not so bad a scenario, is it?

After Peter pointed out that it shouldn't happen across clusters due to
group_classify()/sg_capacity_factor() it isn't as bad as I initially
thought. It is still not an ideal solution I think. Wake-ups aren't nice
for battery-powered devices. Waking up a cpu in an already active
cluster may still imply powering up the core and bringing the L1 cache
into a usable state, but it isn't as bad as waking up a cluster. I would
prefer to avoid it if we can.

Thinking more about it, don't we also risk doing a lot of iterations in
nohz_idle_balance() leading to nothing (pure overhead) in certain corner
cases? If find_new_ild() is the last cpu in the cluster and we have one
task for each cpu in the cluster but one cpu is currently having two.
Don't we end up trying all nohz-idle cpus before giving up and balancing
the balancer cpu itself. On big machines, going through everyone could
take a while I think. No?

Morten

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-04-01 13:03                 ` Morten Rasmussen
@ 2015-04-02  0:55                   ` Jason Low
  2015-04-02  3:22                   ` Preeti U Murthy
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Low @ 2015-04-02  0:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Preeti U Murthy, Peter Zijlstra, mingo, riel, daniel.lezcano,
	vincent.guittot, srikar, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen, jason.low2

On Wed, 2015-04-01 at 14:03 +0100, Morten Rasmussen wrote:

Hi Morten,

> > Alright I see. But it is one additional wake up. And the wake up will be
> > within the cluster. We will not wake up any CPU in the neighboring
> > cluster unless there are tasks to be pulled. So, we can wake up a core
> > out of a deep idle state and never a cluster in the problem described.
> > In terms of energy efficiency, this is not so bad a scenario, is it?
> 
> After Peter pointed out that it shouldn't happen across clusters due to
> group_classify()/sg_capacity_factor() it isn't as bad as I initially
> thought. It is still not an ideal solution I think. Wake-ups aren't nice
> for battery-powered devices. Waking up a cpu in an already active
> cluster may still imply powering up the core and bringing the L1 cache
> into a usable state, but it isn't as bad as waking up a cluster. I would
> prefer to avoid it if we can.

Right. I still think that the patch is justified if it addresses the 10
second latency issue, but if we could find a better solution, that would
be great :)

> Thinking more about it, don't we also risk doing a lot of iterations in
> nohz_idle_balance() leading to nothing (pure overhead) in certain corner
> cases? If find_new_ild() is the last cpu in the cluster and we have one
> task for each cpu in the cluster but one cpu is currently having two.
> Don't we end up trying all nohz-idle cpus before giving up and balancing
> the balancer cpu itself. On big machines, going through everyone could
> take a while I think. No?

Iterating through many CPUs could take a while, but since we only do
nohz_idle_balance() when the CPU is idle and exit if need_resched, then
we're only doing so if there is nothing else that needs to run.

Also, we're only attempting balancing when time_after_eq
rq->next_balance, so much of the time, we don't actually traverse all
the CPUs.

So this may not be too big of an issue.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
  2015-04-01 13:03                 ` Morten Rasmussen
  2015-04-02  0:55                   ` Jason Low
@ 2015-04-02  3:22                   ` Preeti U Murthy
  1 sibling, 0 replies; 33+ messages in thread
From: Preeti U Murthy @ 2015-04-02  3:22 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Jason Low, Peter Zijlstra, mingo, riel, daniel.lezcano,
	vincent.guittot, srikar, pjt, benh, efault, linux-kernel,
	iamjoonsoo.kim, svaidy, tim.c.chen

Hi Morten,

On 04/01/2015 06:33 PM, Morten Rasmussen wrote:
>> Alright I see. But it is one additional wake up. And the wake up will be
>> within the cluster. We will not wake up any CPU in the neighboring
>> cluster unless there are tasks to be pulled. So, we can wake up a core
>> out of a deep idle state and never a cluster in the problem described.
>> In terms of energy efficiency, this is not so bad a scenario, is it?
> 
> After Peter pointed out that it shouldn't happen across clusters due to
> group_classify()/sg_capacity_factor() it isn't as bad as I initially
> thought. It is still not an ideal solution I think. Wake-ups aren't nice
> for battery-powered devices. Waking up a cpu in an already active
> cluster may still imply powering up the core and bringing the L1 cache
> into a usable state, but it isn't as bad as waking up a cluster. I would
> prefer to avoid it if we can.
> 
> Thinking more about it, don't we also risk doing a lot of iterations in
> nohz_idle_balance() leading to nothing (pure overhead) in certain corner
> cases? If find_new_ild() is the last cpu in the cluster and we have one
> task for each cpu in the cluster but one cpu is currently having two.
> Don't we end up trying all nohz-idle cpus before giving up and balancing
> the balancer cpu itself. On big machines, going through everyone could
> take a while I think. No?

The balancer CPU will iterate as long as need_resched() is not set on
it. It may take a while, but if the balancer CPU has nothing to do, the
iteration will not be at a cost of performance.

Besides this, load balancing itself is optimized in terms of who does it
and how often. The candidate CPUs for load balancing are the first idle
CPUs in a given group. So nohz idle load balancing may abort on some of
the idle CPUs. If the CPUs on our left have not managed to pull tasks,
we abort load balancing too. We will save time here.

Load balancing on bigger sched domains is spaced out in time too. The
min_interval is equal to sd_weight and the balance_interval can be as
large as 2*sd_weight. This should ensure that load balancing across
large scheduling domains are not carried out too often. nohz_idle_load
balancing may therefore not go through the entire scheduling domain
hierarchy for each CPU. This will cut down on the time too.

Regards
Preeti U Murthy
> 
> Morten
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-04-02  3:22 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 13:02 [PATCH V2] sched: Improve load balancing in the presence of idle CPUs Preeti U Murthy
2015-03-26 17:03 ` Jason Low
2015-03-27  2:12 ` Wanpeng Li
2015-03-27  4:33   ` Preeti U Murthy
2015-03-27  4:23     ` Wanpeng Li
2015-03-27  5:01     ` Jason Low
2015-03-27  5:07   ` Jason Low
2015-03-27  5:39     ` Srikar Dronamraju
2015-03-27  7:00       ` Wanpeng Li
2015-03-27  6:43     ` Wanpeng Li
2015-03-27 16:23     ` Preeti U Murthy
2015-03-27 11:43 ` [tip:sched/core] " tip-bot for Preeti U Murthy
2015-03-27 13:03 ` [PATCH V2] " Srikar Dronamraju
2015-03-27 14:38 ` Morten Rasmussen
2015-03-27 16:46   ` Preeti U Murthy
2015-03-27 17:56     ` Morten Rasmussen
2015-03-30  7:26       ` Preeti U Murthy
2015-03-30 11:30         ` Morten Rasmussen
2015-03-30 11:06       ` Peter Zijlstra
2015-03-30 12:03         ` Morten Rasmussen
2015-03-30 12:24           ` Peter Zijlstra
2015-03-30 12:54             ` Peter Zijlstra
2015-03-30 13:29             ` Vincent Guittot
2015-03-30 14:01               ` Peter Zijlstra
2015-03-30 15:27               ` Morten Rasmussen
2015-03-31  8:58           ` Preeti U Murthy
2015-03-31 17:30             ` Jason Low
2015-04-01  6:28               ` Preeti U Murthy
2015-04-01 13:03                 ` Morten Rasmussen
2015-04-02  0:55                   ` Jason Low
2015-04-02  3:22                   ` Preeti U Murthy
2015-03-30 13:45 ` Vincent Guittot
2015-03-31  8:06   ` Preeti U Murthy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.