linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: fix sched-domain avg_load calculation.
@ 2011-04-08  0:23 Ken Chen
  2011-04-08 11:15 ` Peter Zijlstra
  2011-04-11 10:46 ` [tip:sched/urgent] sched: Fix " tip-bot for Ken Chen
  0 siblings, 2 replies; 5+ messages in thread
From: Ken Chen @ 2011-04-08  0:23 UTC (permalink / raw)
  To: a.p.zijlstra, mingo; +Cc: linux-kernel

In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain.
This will cause erroneous imbalance calculation.  The reason is
that calculate_imbalance() sees sds->avg_load = 0 and it will dump
entire sds->max_load into imbalance variable, which is used later
on to migrate entire load from busiest CPU to the puller CPU. It
has two really bad effect:

1. stampede of task migration, and they won't be able to break out
   of the bad state because of positive feedback loop: large load
   delta -> heavier load migration -> larger imbalance and the cycle
   goes on.

2. severe imbalance in CPU queue depth.  This causes really long
   scheduling latency blip which affects badly on application that
   has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases.
This will ensure that imbalance calculation is always sensible and
the target is usually half way between busiest and puller CPU.

Signed-off-by: Ken Chen <kenchen@google.com>
---
 kernel/sched_fair.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c7ec5c8..c46568a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3127,6 +3127,8 @@ find_busiest_group(
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
+	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
+
 	/*
 	 * If the busiest group is imbalanced the below checks don't
 	 * work because they assumes all things are equal, which typically
@@ -3151,7 +3153,6 @@ find_busiest_group(
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
 	if (sds.this_load >= sds.avg_load)
 		goto out_balanced;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: fix sched-domain avg_load calculation.
  2011-04-08  0:23 [PATCH] sched: fix sched-domain avg_load calculation Ken Chen
@ 2011-04-08 11:15 ` Peter Zijlstra
  2011-04-08 19:29   ` Ken Chen
  2011-04-11 10:46 ` [tip:sched/urgent] sched: Fix " tip-bot for Ken Chen
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2011-04-08 11:15 UTC (permalink / raw)
  To: Ken Chen; +Cc: mingo, linux-kernel

On Thu, 2011-04-07 at 17:23 -0700, Ken Chen wrote:
> In function find_busiest_group(), the sched-domain avg_load isn't
> calculated at all if there is a group imbalance within the domain.
> This will cause erroneous imbalance calculation.  The reason is
> that calculate_imbalance() sees sds->avg_load = 0 and it will dump
> entire sds->max_load into imbalance variable, which is used later
> on to migrate entire load from busiest CPU to the puller CPU. It
> has two really bad effect:
> 
> 1. stampede of task migration, and they won't be able to break out
>    of the bad state because of positive feedback loop: large load
>    delta -> heavier load migration -> larger imbalance and the cycle
>    goes on.
> 
> 2. severe imbalance in CPU queue depth.  This causes really long
>    scheduling latency blip which affects badly on application that
>    has tight latency requirement.
> 
> The fix is to have kernel calculate domain avg_load in both cases.
> This will ensure that imbalance calculation is always sensible and
> the target is usually half way between busiest and puller CPU.

Indeed so, it looks like I broke that in 866ab43efd32. Out of curiosity,
what kind of workload did you observe this on?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: fix sched-domain avg_load calculation.
  2011-04-08 11:15 ` Peter Zijlstra
@ 2011-04-08 19:29   ` Ken Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Ken Chen @ 2011-04-08 19:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel

On Fri, Apr 8, 2011 at 4:15 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-04-07 at 17:23 -0700, Ken Chen wrote:
>> In function find_busiest_group(), the sched-domain avg_load isn't
>> calculated at all if there is a group imbalance within the domain.
>> This will cause erroneous imbalance calculation.  The reason is
>> that calculate_imbalance() sees sds->avg_load = 0 and it will dump
>> entire sds->max_load into imbalance variable, which is used later
>> on to migrate entire load from busiest CPU to the puller CPU. It
>> has two really bad effect:
>>
>> 1. stampede of task migration, and they won't be able to break out
>>    of the bad state because of positive feedback loop: large load
>>    delta -> heavier load migration -> larger imbalance and the cycle
>>    goes on.
>>
>> 2. severe imbalance in CPU queue depth.  This causes really long
>>    scheduling latency blip which affects badly on application that
>>    has tight latency requirement.
>>
>> The fix is to have kernel calculate domain avg_load in both cases.
>> This will ensure that imbalance calculation is always sensible and
>> the target is usually half way between busiest and puller CPU.
>
> Indeed so, it looks like I broke that in 866ab43efd32. Out of curiosity,
> what kind of workload did you observe this on?

This was observed on application that serves websearch query.  There
were uneven CPU queue depth in the system, which leads to long query
latency tail.  The latency tail were both high in occurring frequency
as well as streched out in time.

With this fix, both server throughput and latency response were improved.

- Ken

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip:sched/urgent] sched: Fix sched-domain avg_load calculation
  2011-04-08  0:23 [PATCH] sched: fix sched-domain avg_load calculation Ken Chen
  2011-04-08 11:15 ` Peter Zijlstra
@ 2011-04-11 10:46 ` tip-bot for Ken Chen
  2011-04-11 11:00   ` Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: tip-bot for Ken Chen @ 2011-04-11 10:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, stable, kenchen, tglx, mingo

Commit-ID:  b0432d8f162c7d5d9537b4cb749d44076b76a783
Gitweb:     http://git.kernel.org/tip/b0432d8f162c7d5d9537b4cb749d44076b76a783
Author:     Ken Chen <kenchen@google.com>
AuthorDate: Thu, 7 Apr 2011 17:23:22 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 11 Apr 2011 11:08:54 +0200

sched: Fix sched-domain avg_load calculation

In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.

The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.

This has two really bad effect:

1. stampede of task migration, and they won't be able to break out
   of the bad state because of positive feedback loop: large load
   delta -> heavier load migration -> larger imbalance and the cycle
   goes on.

2. severe imbalance in CPU queue depth.  This causes really long
   scheduling latency blip which affects badly on application that
   has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7f00772..60f9d40 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3127,6 +3127,8 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
+	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
+
 	/*
 	 * If the busiest group is imbalanced the below checks don't
 	 * work because they assumes all things are equal, which typically
@@ -3151,7 +3153,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
 	if (sds.this_load >= sds.avg_load)
 		goto out_balanced;
 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [tip:sched/urgent] sched: Fix sched-domain avg_load calculation
  2011-04-11 10:46 ` [tip:sched/urgent] sched: Fix " tip-bot for Ken Chen
@ 2011-04-11 11:00   ` Peter Zijlstra
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2011-04-11 11:00 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, kenchen, stable, tglx, mingo; +Cc: linux-tip-commits

On Mon, 2011-04-11 at 10:46 +0000, tip-bot for Ken Chen wrote:
> Commit-ID:  b0432d8f162c7d5d9537b4cb749d44076b76a783
> Gitweb:     http://git.kernel.org/tip/b0432d8f162c7d5d9537b4cb749d44076b76a783
> Author:     Ken Chen <kenchen@google.com>
> AuthorDate: Thu, 7 Apr 2011 17:23:22 -0700
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Mon, 11 Apr 2011 11:08:54 +0200
> 
> sched: Fix sched-domain avg_load calculation
> 
> In function find_busiest_group(), the sched-domain avg_load isn't
> calculated at all if there is a group imbalance within the domain. This
> will cause erroneous imbalance calculation.
> 
> The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
> will dump entire sds->max_load into imbalance variable, which is used
> later on to migrate entire load from busiest CPU to the puller CPU.
> 
> This has two really bad effect:
> 
> 1. stampede of task migration, and they won't be able to break out
>    of the bad state because of positive feedback loop: large load
>    delta -> heavier load migration -> larger imbalance and the cycle
>    goes on.
> 
> 2. severe imbalance in CPU queue depth.  This causes really long
>    scheduling latency blip which affects badly on application that
>    has tight latency requirement.
> 
> The fix is to have kernel calculate domain avg_load in both cases. This
> will ensure that imbalance calculation is always sensible and the target
> is usually half way between busiest and puller CPU.
> 
> Signed-off-by: Ken Chen <kenchen@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: <stable@kernel.org>
> Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
> Signed-off-by: Ingo Molnar <mingo@elte.hu> 

This was caused by 866ab43ef (sched: Fix the group_imb logic) which is
only in .39-rc.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-11 11:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-08  0:23 [PATCH] sched: fix sched-domain avg_load calculation Ken Chen
2011-04-08 11:15 ` Peter Zijlstra
2011-04-08 19:29   ` Ken Chen
2011-04-11 10:46 ` [tip:sched/urgent] sched: Fix " tip-bot for Ken Chen
2011-04-11 11:00   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).