All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4][RFC v2] Improve load balancing when tasks have large weight differential
@ 2010-10-13 19:09 Nikhil Rao
  2010-10-13 19:09 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-13 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
	Venkatesh Pallipadi
  Cc: linux-kernel, Nikhil Rao

Hi all,

Please find attached a series of patches that improve load balancing when
there is a large weight differential between tasks (such as when nicing a task
or when using SCHED_IDLE). These patches are based off feedback given by Peter
Zijlstra and Mike Galbraith in earlier posts.

Previous versions:
-v0: http://thread.gmane.org/gmane.linux.kernel/1015966
     Large weight differential leads to inefficient load balancing

-v1: http://thread.gmane.org/gmane.linux.kernel/1041721
     Improve load balancing when tasks have large weight differential

These patches can be applied to v2.6.36-rc7 or -tip without conflicts. Below
are some tests that highlight the improvements with this patchset.

1. 16 SCHED_IDLE soakers, 1 SCHED_NORMAL task on 16 cpu machine.
Tested on a quad-cpu, quad-socket. Steps to reproduce:
- spawn 16 SCHED_IDLE tasks
- spawn one nice 0 task
- system utilization immediately drops to 80% on v2.6.36-rc7

v2.6.36-rc7

10:38:46 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:38:47 AM  all   80.69    0.00    0.50    0.00    0.00    0.00    0.00   18.82  14008.00
10:38:48 AM  all   85.09    0.06    0.50    0.00    0.00    0.00    0.00   14.35  14690.00
10:38:49 AM  all   86.83    0.06    0.44    0.00    0.00    0.00    0.00   12.67  14314.85
10:38:50 AM  all   79.89    0.00    0.37    0.00    0.00    0.00    0.00   19.74  14035.35
10:38:51 AM  all   87.94    0.06    0.44    0.00    0.00    0.00    0.00   11.56  14991.00
10:38:52 AM  all   83.27    0.06    0.37    0.00    0.00    0.00    0.00   16.29  14319.00
10:38:53 AM  all   94.37    0.13    0.50    0.00    0.00    0.00    0.00    5.00  15930.00
10:38:54 AM  all   87.06    0.06    0.62    0.00    0.00    0.06    0.00   12.19  14946.00
10:38:55 AM  all   88.68    0.06    0.38    0.00    0.00    0.00    0.00   10.88  14767.00
10:38:56 AM  all   80.16    0.00    1.06    0.00    0.00    0.00    0.00   18.78  13892.08
Average:     all   85.38    0.05    0.52    0.00    0.00    0.01    0.00   14.05  14588.91


v2.6.36-rc7 + patchset:

10:40:31 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:40:32 AM  all   99.25    0.00    0.75    0.00    0.00    0.00    0.00    0.00  16998.00
10:40:33 AM  all   99.75    0.00    0.19    0.00    0.00    0.06    0.00    0.00  16337.00
10:40:34 AM  all   98.75    0.00    1.25    0.00    0.00    0.00    0.00    0.00  17127.27
10:40:35 AM  all   99.06    0.00    0.94    0.00    0.00    0.00    0.00    0.00  16741.58
10:40:36 AM  all   99.50    0.06    0.44    0.00    0.00    0.00    0.00    0.00  16477.00
10:40:37 AM  all   99.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00  16868.69
10:40:38 AM  all   99.13    0.00    0.81    0.00    0.00    0.06    0.00    0.00  16761.39
10:40:39 AM  all   99.19    0.00    0.81    0.00    0.00    0.00    0.00    0.00  17501.00
10:40:40 AM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00  18209.00
10:40:41 AM  all   99.19    0.00    0.81    0.00    0.00    0.00    0.00    0.00  16862.00
Average:     all   99.32    0.01    0.66    0.00    0.00    0.01    0.00    0.00  16987.80


2. Sub-optimal utilizataion in presence of niced task.
Tested on a dual-socket/quad-core w/ two cores on each socket disabled.
Steps to reproduce:
- spawn 4 nice 0 soakers and one nice -15 soaker
- force all tasks onto one cpu by setting affinities
- reset affinity masks

v2.6.36-rc7:

Cpu(s): 34.3% us,  0.2% sy,  0.0% ni, 65.1% id,  0.4% wa,  0.0% hi,  0.0% si
Mem:  16463308k total,   996368k used, 15466940k free,    12304k buffers
Swap:        0k total,        0k used,        0k free,   756244k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7651 root       5 -15  5876   84    0 R   98  0.0  37:35.97 soaker
 7652 root      20   0  5876   84    0 R   49  0.0  19:49.02 soaker
 7654 root      20   0  5876   84    0 R   49  0.0  20:48.93 soaker
 7655 root      20   0  5876   84    0 R   49  0.0  19:25.74 soaker
 7653 root      20   0  5876   84    0 R   47  0.0  20:02.16 soaker

v2.6.36-rc7 + patchset:

Cpu(s): 49.7% us,  0.2% sy,  0.0% ni, 50.2% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  16463308k total,  1011248k used, 15452060k free,    10076k buffers
Swap:        0k total,        0k used,        0k free,   766388k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7645 root       5 -15  5876   88    0 R  100  0.0  43:38.76 soaker
 7646 root      20   0  5876   88    0 R   99  0.0  33:15.25 soaker
 7648 root      20   0  5876   88    0 R   75  0.0  36:57.02 soaker
 7647 root      20   0  5876   88    0 R   67  0.0  29:12.97 soaker
 7649 root      20   0  5876   88    0 R   54  0.0  29:28.35 soaker

Comments, feedback welcome.

-Thanks,
Nikhil

Nikhil Rao (4):
  sched: do not consider SCHED_IDLE tasks to be cache hot
  sched: set group_imb only a task can be pulled from the busiest cpu
  sched: drop group_capacity to 1 only if local group has extra
    capacity
  sched: force balancing on newidle balance if local group has capacity

 kernel/sched.c      |    3 +++
 kernel/sched_fair.c |   48 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 43 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot
  2010-10-13 19:09 [PATCH 0/4][RFC v2] Improve load balancing when tasks have large weight differential Nikhil Rao
@ 2010-10-13 19:09 ` Nikhil Rao
  2010-10-13 19:09 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-13 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
	Venkatesh Pallipadi
  Cc: linux-kernel, Nikhil Rao

This patch adds a check in task_hot to return if the task has SCHED_IDLE
policy. SCHED_IDLE tasks have very low weight, and when run with regular
workloads, are typically scheduled many milliseconds apart. There is no
need to consider these tasks hot for load balancing.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..6a9bdeb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2003,6 +2003,9 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	if (p->sched_class != &fair_sched_class)
 		return 0;
 
+	if (p->policy == SCHED_IDLE)
+		return 0;
+
 	/*
 	 * Buddy candidates are cache hot:
 	 */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu
  2010-10-13 19:09 [PATCH 0/4][RFC v2] Improve load balancing when tasks have large weight differential Nikhil Rao
  2010-10-13 19:09 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
@ 2010-10-13 19:09 ` Nikhil Rao
  2010-10-18 19:23   ` [tip:sched/core] sched: Set " tip-bot for Nikhil Rao
  2010-10-13 19:09 ` [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
  2010-10-13 19:09 ` [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
  3 siblings, 1 reply; 21+ messages in thread
From: Nikhil Rao @ 2010-10-13 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
	Venkatesh Pallipadi
  Cc: linux-kernel, Nikhil Rao

When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..0dd1021 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2378,7 +2378,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			int local_group, const struct cpumask *cpus,
 			int *balance, struct sg_lb_stats *sgs)
 {
-	unsigned long load, max_cpu_load, min_cpu_load;
+	unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
 	int i;
 	unsigned int balance_cpu = -1, first_idle_cpu = 0;
 	unsigned long avg_load_per_task = 0;
@@ -2389,6 +2389,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	/* Tally up the load of all CPUs in the group */
 	max_cpu_load = 0;
 	min_cpu_load = ~0UL;
+	max_nr_running = 0;
 
 	for_each_cpu_and(i, sched_group_cpus(group), cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -2406,8 +2407,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			load = target_load(i, load_idx);
 		} else {
 			load = source_load(i, load_idx);
-			if (load > max_cpu_load)
+			if (load > max_cpu_load) {
 				max_cpu_load = load;
+				max_nr_running = rq->nr_running;
+			}
 			if (min_cpu_load > load)
 				min_cpu_load = load;
 		}
@@ -2447,7 +2450,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	if (sgs->sum_nr_running)
 		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
+	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task &&
+			max_nr_running > 1)
 		sgs->group_imb = 1;
 
 	sgs->group_capacity =
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-13 19:09 [PATCH 0/4][RFC v2] Improve load balancing when tasks have large weight differential Nikhil Rao
  2010-10-13 19:09 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
  2010-10-13 19:09 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
@ 2010-10-13 19:09 ` Nikhil Rao
  2010-10-14  5:48   ` Nikhil Rao
  2010-10-13 19:09 ` [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
  3 siblings, 1 reply; 21+ messages in thread
From: Nikhil Rao @ 2010-10-13 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
	Venkatesh Pallipadi
  Cc: linux-kernel, Nikhil Rao

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. For niced task balancing, we pull
low weight tasks away from a sched group as long as there is capacity in other
groups. When all other groups are saturated, we do not drop capacity of the
niced group down to 1. This prevents active balance from kicking out the low
weight threads and which hurts system utilization.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0dd1021..2f38b8a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2030,6 +2030,7 @@ struct sd_lb_stats {
 	unsigned long this_load;
 	unsigned long this_load_per_task;
 	unsigned long this_nr_running;
+	unsigned long this_group_capacity;
 
 	/* Statistics of the busiest group */
 	unsigned long max_load;
@@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
 		/*
 		 * In case the child domain prefers tasks go to siblings
 		 * first, lower the sg capacity to one so that we'll try
-		 * and move all the excess tasks away.
+		 * and move all the excess tasks away. We lower capacity only
+		 * if the local group can handle the extra capacity.
 		 */
-		if (prefer_sibling)
+		if (prefer_sibling && !local_group &&
++                   sds->this_nr_running < sds->this_group_capacity)
 			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
 			sds->this = sg;
 			sds->this_nr_running = sgs.sum_nr_running;
+			sds->this_group_capacity = sgs.group_capacity;
 			sds->this_load_per_task = sgs.sum_weighted_load;
 		} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
 			sds->max_load = sgs.avg_load;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-13 19:09 [PATCH 0/4][RFC v2] Improve load balancing when tasks have large weight differential Nikhil Rao
                   ` (2 preceding siblings ...)
  2010-10-13 19:09 ` [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
@ 2010-10-13 19:09 ` Nikhil Rao
  2010-10-15 12:06   ` Peter Zijlstra
  2010-10-15 12:08   ` Peter Zijlstra
  3 siblings, 2 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-13 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
	Venkatesh Pallipadi
  Cc: linux-kernel, Nikhil Rao

This patch forces a load balance on a newly idle cpu when the local group has
extra capacity and the busiest group does not have any. It improves system
utilization when balancing tasks with a large weight differential.

Under certain situations, such as a niced down task (i.e. nice = -15) in the
presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
kicks away other tasks because of its large weight. This leads to sub-optimal
utilization of the machine. Even though the sched group has capacity, it does
not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.

With this patch, if the local group has extra capacity, we shortcut the checks
in f_b_g() and try to pull a task over. A sched group has extra capacity if the
group capacity is greater than the number of running tasks in that group.

Thanks to Mike Galbraith for discussions leading to this patch and for the
insight to reuse SD_NEWIDLE_BALANCE.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   30 +++++++++++++++++++++++++++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 2f38b8a..202fa25 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1764,6 +1764,10 @@ static void pull_task(struct rq *src_rq, struct task_struct *p,
 	set_task_cpu(p, this_cpu);
 	activate_task(this_rq, p, 0);
 	check_preempt_curr(this_rq, p, 0);
+
+	/* re-arm NEWIDLE balancing when moving tasks */
+	src_rq->avg_idle = this_rq->avg_idle = 2*sysctl_sched_migration_cost;
+	this_rq->idle_stamp = 0;
 }
 
 /*
@@ -2031,12 +2035,14 @@ struct sd_lb_stats {
 	unsigned long this_load_per_task;
 	unsigned long this_nr_running;
 	unsigned long this_group_capacity;
+	unsigned long this_has_capacity;
 
 	/* Statistics of the busiest group */
 	unsigned long max_load;
 	unsigned long busiest_load_per_task;
 	unsigned long busiest_nr_running;
 	unsigned long busiest_group_capacity;
+	unsigned long busiest_has_capacity;
 
 	int group_imb; /* Is there imbalance in this sd */
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2059,6 +2065,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long group_capacity;
 	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
 /**
@@ -2459,6 +2466,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
+
+	if (sgs->group_capacity > sgs->sum_nr_running)
+		sgs->group_has_capacity = 1;
 }
 
 /**
@@ -2560,12 +2570,14 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
 			sds->this_nr_running = sgs.sum_nr_running;
 			sds->this_group_capacity = sgs.group_capacity;
 			sds->this_load_per_task = sgs.sum_weighted_load;
+			sds->this_has_capacity = sgs.group_has_capacity;
 		} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
 			sds->max_load = sgs.avg_load;
 			sds->busiest = sg;
 			sds->busiest_nr_running = sgs.sum_nr_running;
 			sds->busiest_group_capacity = sgs.group_capacity;
 			sds->busiest_load_per_task = sgs.sum_weighted_load;
+			sds->busiest_has_capacity = sgs.group_has_capacity;
 			sds->group_imb = sgs.group_imb;
 		}
 
@@ -2762,6 +2774,15 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
 		return fix_small_imbalance(sds, this_cpu, imbalance);
 
 }
+
+bool check_utilization(struct sd_lb_stats *sds)
+{
+	if (!sds->this_has_capacity || sds->busiest_has_capacity)
+		return false;
+
+	return true;
+}
+
 /******* find_busiest_group() helpers end here *********************/
 
 /**
@@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (!sds.busiest || sds.busiest_nr_running == 0)
 		goto out_balanced;
 
+	/*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
+	if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
+		goto force_balance;
+
 	if (sds.this_load >= sds.max_load)
 		goto out_balanced;
 
@@ -2835,6 +2860,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
 		goto out_balanced;
 
+force_balance:
 	/* Looks like there is an imbalance. Compute it */
 	calculate_imbalance(&sds, this_cpu, imbalance);
 	return sds.busiest;
@@ -3161,10 +3187,8 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
 		interval = msecs_to_jiffies(sd->balance_interval);
 		if (time_after(next_balance, sd->last_balance + interval))
 			next_balance = sd->last_balance + interval;
-		if (pulled_task) {
-			this_rq->idle_stamp = 0;
+		if (pulled_task)
 			break;
-		}
 	}
 
 	raw_spin_lock(&this_rq->lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-13 19:09 ` [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
@ 2010-10-14  5:48   ` Nikhil Rao
  2010-10-14 23:42     ` Suresh Siddha
  2010-10-15 11:50     ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-14  5:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
	Venkatesh Pallipadi
  Cc: linux-kernel, Nikhil Rao

Resending this patch since the original patch was munged. Thanks to Mike
Galbraith for pointing this out.

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. For niced task balancing, we pull
low weight tasks away from a sched group as long as there is capacity in other
groups. When all other groups are saturated, we do not drop capacity of the
niced group down to 1. This prevents active balance from kicking out the low
weight threads and which hurts system utilization.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0dd1021..da0c688 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2030,6 +2030,7 @@ struct sd_lb_stats {
 	unsigned long this_load;
 	unsigned long this_load_per_task;
 	unsigned long this_nr_running;
+	unsigned long this_group_capacity;
 
 	/* Statistics of the busiest group */
 	unsigned long max_load;
@@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
 		/*
 		 * In case the child domain prefers tasks go to siblings
 		 * first, lower the sg capacity to one so that we'll try
-		 * and move all the excess tasks away.
+		 * and move all the excess tasks away. We lower capacity only
+		 * if the local group can handle the extra capacity.
 		 */
-		if (prefer_sibling)
+		if (prefer_sibling && !local_group &&
+		    sds->this_nr_running < sds->this_group_capacity)
 			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
 			sds->this = sg;
 			sds->this_nr_running = sgs.sum_nr_running;
+			sds->this_group_capacity = sgs.group_capacity;
 			sds->this_load_per_task = sgs.sum_weighted_load;
 		} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
 			sds->max_load = sgs.avg_load;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-14  5:48   ` Nikhil Rao
@ 2010-10-14 23:42     ` Suresh Siddha
  2010-10-15 11:50     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Suresh Siddha @ 2010-10-14 23:42 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Venkatesh Pallipadi,
	linux-kernel

On Wed, 2010-10-13 at 22:48 -0700, Nikhil Rao wrote:
> Resending this patch since the original patch was munged. Thanks to Mike
> Galbraith for pointing this out.
> 
> When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
> only if the local group has extra capacity. For niced task balancing, we pull
> low weight tasks away from a sched group as long as there is capacity in other
> groups. When all other groups are saturated, we do not drop capacity of the
> niced group down to 1. This prevents active balance from kicking out the low
> weight threads and which hurts system utilization.
> 
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> ---
>  kernel/sched_fair.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0dd1021..da0c688 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2030,6 +2030,7 @@ struct sd_lb_stats {
>  	unsigned long this_load;
>  	unsigned long this_load_per_task;
>  	unsigned long this_nr_running;
> +	unsigned long this_group_capacity;
>  
>  	/* Statistics of the busiest group */
>  	unsigned long max_load;
> @@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
>  		/*
>  		 * In case the child domain prefers tasks go to siblings
>  		 * first, lower the sg capacity to one so that we'll try
> -		 * and move all the excess tasks away.
> +		 * and move all the excess tasks away. We lower capacity only
> +		 * if the local group can handle the extra capacity.
>  		 */
> -		if (prefer_sibling)
> +		if (prefer_sibling && !local_group &&
> +		    sds->this_nr_running < sds->this_group_capacity)
>  			sgs.group_capacity = min(sgs.group_capacity, 1UL);

Yes Nikhil. This should solve my earlier concern of SMT/MC idle
balancing case.

Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>

>  
>  		if (local_group) {
>  			sds->this_load = sgs.avg_load;
>  			sds->this = sg;
>  			sds->this_nr_running = sgs.sum_nr_running;
> +			sds->this_group_capacity = sgs.group_capacity;
>  			sds->this_load_per_task = sgs.sum_weighted_load;
>  		} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
>  			sds->max_load = sgs.avg_load;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-14  5:48   ` Nikhil Rao
  2010-10-14 23:42     ` Suresh Siddha
@ 2010-10-15 11:50     ` Peter Zijlstra
  2010-10-15 16:13       ` Nikhil Rao
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-10-15 11:50 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Wed, 2010-10-13 at 22:48 -0700, Nikhil Rao wrote:
> Resending this patch since the original patch was munged. Thanks to Mike
> Galbraith for pointing this out.
> 
> When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
> only if the local group has extra capacity. For niced task balancing, we pull
> low weight tasks away from a sched group as long as there is capacity in other
> groups. When all other groups are saturated, we do not drop capacity of the
> niced group down to 1. This prevents active balance from kicking out the low
> weight threads and which hurts system utilization.
> 
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> ---
>  kernel/sched_fair.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0dd1021..da0c688 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2030,6 +2030,7 @@ struct sd_lb_stats {
>  	unsigned long this_load;
>  	unsigned long this_load_per_task;
>  	unsigned long this_nr_running;
> +	unsigned long this_group_capacity;
>  
>  	/* Statistics of the busiest group */
>  	unsigned long max_load;
> @@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
>  		/*
>  		 * In case the child domain prefers tasks go to siblings
>  		 * first, lower the sg capacity to one so that we'll try
> -		 * and move all the excess tasks away.
> +		 * and move all the excess tasks away. We lower capacity only
> +		 * if the local group can handle the extra capacity.
>  		 */
> -		if (prefer_sibling)
> +		if (prefer_sibling && !local_group &&
> +		    sds->this_nr_running < sds->this_group_capacity)
>  			sgs.group_capacity = min(sgs.group_capacity, 1UL);
>  
>  		if (local_group) {
>  			sds->this_load = sgs.avg_load;
>  			sds->this = sg;
>  			sds->this_nr_running = sgs.sum_nr_running;
> +			sds->this_group_capacity = sgs.group_capacity;
>  			sds->this_load_per_task = sgs.sum_weighted_load;
>  		} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
>  			sds->max_load = sgs.avg_load;


OK, this thing confuses me, that changelog nor the comment actually seem
to help with understanding this..



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-13 19:09 ` [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
@ 2010-10-15 12:06   ` Peter Zijlstra
  2010-10-15 12:18     ` Mike Galbraith
  2010-10-15 12:08   ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-10-15 12:06 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
> +bool check_utilization(struct sd_lb_stats *sds)
> +{
> +       if (!sds->this_has_capacity || sds->busiest_has_capacity)
> +               return false;
> +
> +       return true;
> +}
> +
>  /******* find_busiest_group() helpers end here *********************/
>  
>  /**
> @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
>         if (!sds.busiest || sds.busiest_nr_running == 0)
>                 goto out_balanced;
>  
> +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
> +               goto force_balance; 


Is that really worth an extra function? Also the name isn't really
helpful, the comment suggests it should be called something like:
is_under_utilized().

Hmm?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-13 19:09 ` [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
  2010-10-15 12:06   ` Peter Zijlstra
@ 2010-10-15 12:08   ` Peter Zijlstra
  2010-10-15 16:20     ` Nikhil Rao
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-10-15 12:08 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
> @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int
> this_cpu,
>         if (!sds.busiest || sds.busiest_nr_running == 0)
>                 goto out_balanced;
>  
> +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
> +               goto force_balance;


There's a large comment a few lines up from here that tries to explain
all these funny reasons for not balancing, that wants updating too.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-15 12:06   ` Peter Zijlstra
@ 2010-10-15 12:18     ` Mike Galbraith
  2010-10-15 12:20       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2010-10-15 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikhil Rao, Ingo Molnar, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, 2010-10-15 at 14:06 +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
> > +bool check_utilization(struct sd_lb_stats *sds)
> > +{
> > +       if (!sds->this_has_capacity || sds->busiest_has_capacity)
> > +               return false;
> > +
> > +       return true;
> > +}
> > +
> >  /******* find_busiest_group() helpers end here *********************/
> >  
> >  /**
> > @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> >         if (!sds.busiest || sds.busiest_nr_running == 0)
> >                 goto out_balanced;
> >  
> > +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> > +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
> > +               goto force_balance; 
> 
> 
> Is that really worth an extra function?

(I did that)

No, just it made it look prettier to me.  I figured the compiler will
nuke it at zero cost.

	-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-15 12:18     ` Mike Galbraith
@ 2010-10-15 12:20       ` Peter Zijlstra
  2010-10-15 12:35         ` Mike Galbraith
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-10-15 12:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nikhil Rao, Ingo Molnar, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, 2010-10-15 at 14:18 +0200, Mike Galbraith wrote:
> On Fri, 2010-10-15 at 14:06 +0200, Peter Zijlstra wrote:
> > On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
> > > +bool check_utilization(struct sd_lb_stats *sds)
> > > +{
> > > +       if (!sds->this_has_capacity || sds->busiest_has_capacity)
> > > +               return false;
> > > +
> > > +       return true;
> > > +}
> > > +
> > >  /******* find_busiest_group() helpers end here *********************/
> > >  
> > >  /**
> > > @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> > >         if (!sds.busiest || sds.busiest_nr_running == 0)
> > >                 goto out_balanced;
> > >  
> > > +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> > > +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
> > > +               goto force_balance; 
> > 
> > 
> > Is that really worth an extra function?
> 
> (I did that)
> 
> No, just it made it look prettier to me.  I figured the compiler will
> nuke it at zero cost.

Sure.. but it does raise the whole naming/confusion angle ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-15 12:20       ` Peter Zijlstra
@ 2010-10-15 12:35         ` Mike Galbraith
  2010-10-15 16:19           ` Nikhil Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2010-10-15 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikhil Rao, Ingo Molnar, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, 2010-10-15 at 14:20 +0200, Peter Zijlstra wrote:
> On Fri, 2010-10-15 at 14:18 +0200, Mike Galbraith wrote:
> > On Fri, 2010-10-15 at 14:06 +0200, Peter Zijlstra wrote:
> > > On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
> > > > +bool check_utilization(struct sd_lb_stats *sds)
> > > > +{
> > > > +       if (!sds->this_has_capacity || sds->busiest_has_capacity)
> > > > +               return false;
> > > > +
> > > > +       return true;
> > > > +}
> > > > +
> > > >  /******* find_busiest_group() helpers end here *********************/
> > > >  
> > > >  /**
> > > > @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> > > >         if (!sds.busiest || sds.busiest_nr_running == 0)
> > > >                 goto out_balanced;
> > > >  
> > > > +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> > > > +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
> > > > +               goto force_balance; 
> > > 
> > > 
> > > Is that really worth an extra function?
> > 
> > (I did that)
> > 
> > No, just it made it look prettier to me.  I figured the compiler will
> > nuke it at zero cost.
> 
> Sure.. but it does raise the whole naming/confusion angle ;-)

is_under_utilized() works for me.

(as does && this && that or cpu_should_get_off_lazy_butt():)

	-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-15 11:50     ` Peter Zijlstra
@ 2010-10-15 16:13       ` Nikhil Rao
  2010-10-15 17:05         ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Nikhil Rao @ 2010-10-15 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, Oct 15, 2010 at 4:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-10-13 at 22:48 -0700, Nikhil Rao wrote:
>> Resending this patch since the original patch was munged. Thanks to Mike
>> Galbraith for pointing this out.
>>
>> When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
>> only if the local group has extra capacity. For niced task balancing, we pull
>> low weight tasks away from a sched group as long as there is capacity in other
>> groups. When all other groups are saturated, we do not drop capacity of the
>> niced group down to 1. This prevents active balance from kicking out the low
>> weight threads and which hurts system utilization.
>>
>> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> ---
>>  kernel/sched_fair.c |    8 ++++++--
>>  1 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 0dd1021..da0c688 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -2030,6 +2030,7 @@ struct sd_lb_stats {
>>       unsigned long this_load;
>>       unsigned long this_load_per_task;
>>       unsigned long this_nr_running;
>> +     unsigned long this_group_capacity;
>>
>>       /* Statistics of the busiest group */
>>       unsigned long max_load;
>> @@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
>>               /*
>>                * In case the child domain prefers tasks go to siblings
>>                * first, lower the sg capacity to one so that we'll try
>> -              * and move all the excess tasks away.
>> +              * and move all the excess tasks away. We lower capacity only
>> +              * if the local group can handle the extra capacity.
>>                */
>> -             if (prefer_sibling)
>> +             if (prefer_sibling && !local_group &&
>> +                 sds->this_nr_running < sds->this_group_capacity)
>>                       sgs.group_capacity = min(sgs.group_capacity, 1UL);
>>
>>               if (local_group) {
>>                       sds->this_load = sgs.avg_load;
>>                       sds->this = sg;
>>                       sds->this_nr_running = sgs.sum_nr_running;
>> +                     sds->this_group_capacity = sgs.group_capacity;
>>                       sds->this_load_per_task = sgs.sum_weighted_load;
>>               } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
>>                       sds->max_load = sgs.avg_load;
>
>
> OK, this thing confuses me, that changelog nor the comment actually seem
> to help with understanding this..
>

Here's some context for this patch. Consider a 16-cpu quad-core
quad-socket machine with MC and NUMA scheduling domains. Let's say we
spawn 15 nice0 tasks and one nice-15 task, and each task is running on
one core. In this case, we observe the following events when balancing
at the NUMA domain:
- find_busiest_group() will always pick the sched group containing the
niced task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running
the nice0 task (never picks the cpu with the nice -15 task since
weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running
task and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until
sd->nr_balance_failed > 5, at which point it kicks off the active load
balancer, wakes up the migration thread and kicks the nice 0 task off
the cpu.

The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.

The problem is that, when balancing at the NUMA domain, we always drop
sgs.group_capacity to 1 if the child domain (in this case MC) has
SD_PREFER_SIBLING set.  Once we drop the sgs.group_capacity, the
subsequent load comparisons were kinda irrelevant because the niced
task has so much weight. In this patch, we add an extra condition to
the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the
capacity of a group only if the local group has extra capacity (i.e.
if local_group->sum_nr_running < local_group->group_capacity). If I
understand correctly, the original intent of the prefer_siblings check
was to spread tasks across the system in low utilization scenarios.
This patch preserves that intent but also fixes the case above.

It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
group_capacity down to 1 if we prefer siblings -- nothing changes.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running
will most likely be > sgs.group_capacity, so again -- nothing changes.
- When balancing large weight tasks, if the local group does not have
extra capacity, we do not pick the group with the niced task as the
busiest group. This prevents failed balances, active migration and the
under-utilization described above.

Hope that clarifies the intent of this patch. I will refresh this
patch with a better changelog and comment.

-Thanks,
Nikhil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-15 12:35         ` Mike Galbraith
@ 2010-10-15 16:19           ` Nikhil Rao
  0 siblings, 0 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-15 16:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, Oct 15, 2010 at 5:35 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Fri, 2010-10-15 at 14:20 +0200, Peter Zijlstra wrote:
>> On Fri, 2010-10-15 at 14:18 +0200, Mike Galbraith wrote:
>> > On Fri, 2010-10-15 at 14:06 +0200, Peter Zijlstra wrote:
>> > > On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
>> > > > +bool check_utilization(struct sd_lb_stats *sds)
>> > > > +{
>> > > > +       if (!sds->this_has_capacity || sds->busiest_has_capacity)
>> > > > +               return false;
>> > > > +
>> > > > +       return true;
>> > > > +}
>> > > > +
>> > > >  /******* find_busiest_group() helpers end here *********************/
>> > > >
>> > > >  /**
>> > > > @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
>> > > >         if (!sds.busiest || sds.busiest_nr_running == 0)
>> > > >                 goto out_balanced;
>> > > >
>> > > > +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
>> > > > +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
>> > > > +               goto force_balance;
>> > >
>> > >
>> > > Is that really worth an extra function?
>> >
>> > (I did that)
>> >
>> > No, just it made it look prettier to me.  I figured the compiler will
>> > nuke it at zero cost.
>>
>> Sure.. but it does raise the whole naming/confusion angle ;-)
>
> is_under_utilized() works for me.
>
> (as does && this && that or cpu_should_get_off_lazy_butt():)
>

Let's go with the latter for now. If that condition gets more
complicated, then we can factor it out into a different function, like
is_under_utilized().

I'll update the patch and send it out.

-Thanks
Nikhil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity
  2010-10-15 12:08   ` Peter Zijlstra
@ 2010-10-15 16:20     ` Nikhil Rao
  0 siblings, 0 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-15 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, Oct 15, 2010 at 5:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-10-13 at 12:09 -0700, Nikhil Rao wrote:
>> @@ -2824,6 +2845,10 @@ find_busiest_group(struct sched_domain *sd, int
>> this_cpu,
>>         if (!sds.busiest || sds.busiest_nr_running == 0)
>>                 goto out_balanced;
>>
>> +       /*  SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
>> +       if (idle == CPU_NEWLY_IDLE && check_utilization(&sds))
>> +               goto force_balance;
>
>
> There's a large comment a few lines up from here that tries to explain
> all these funny reasons for not balancing, that wants updating too.
>

Will do.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-15 16:13       ` Nikhil Rao
@ 2010-10-15 17:05         ` Peter Zijlstra
  2010-10-15 17:13           ` Suresh Siddha
  2010-10-15 17:27           ` Nikhil Rao
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-10-15 17:05 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, 2010-10-15 at 09:13 -0700, Nikhil Rao wrote:

> >> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> >> index 0dd1021..da0c688 100644
> >> --- a/kernel/sched_fair.c
> >> +++ b/kernel/sched_fair.c
> >> @@ -2030,6 +2030,7 @@ struct sd_lb_stats {
> >>       unsigned long this_load;
> >>       unsigned long this_load_per_task;
> >>       unsigned long this_nr_running;
> >> +     unsigned long this_group_capacity;
> >>
> >>       /* Statistics of the busiest group */
> >>       unsigned long max_load;
> >> @@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> >>               /*
> >>                * In case the child domain prefers tasks go to siblings
> >>                * first, lower the sg capacity to one so that we'll try
> >> -              * and move all the excess tasks away.
> >> +              * and move all the excess tasks away. We lower capacity only
> >> +              * if the local group can handle the extra capacity.
> >>                */
> >> -             if (prefer_sibling)
> >> +             if (prefer_sibling && !local_group &&
> >> +                 sds->this_nr_running < sds->this_group_capacity)
> >>                       sgs.group_capacity = min(sgs.group_capacity, 1UL);
> >>
> >>               if (local_group) {
> >>                       sds->this_load = sgs.avg_load;
> >>                       sds->this = sg;
> >>                       sds->this_nr_running = sgs.sum_nr_running;
> >> +                     sds->this_group_capacity = sgs.group_capacity;
> >>                       sds->this_load_per_task = sgs.sum_weighted_load;
> >>               } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
> >>                       sds->max_load = sgs.avg_load;

OK, but then you assume that local_group will always be the first group
served, nor is there any purpose for adding sds->this_group_capacity,
you could keep that local to this function.

For regular balancing local_group will be the first, since we only
ascend the domain tree on the local groups. But its not true for no_hz
balancing afaikt.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-15 17:05         ` Peter Zijlstra
@ 2010-10-15 17:13           ` Suresh Siddha
  2010-10-15 17:24             ` Peter Zijlstra
  2010-10-15 17:27           ` Nikhil Rao
  1 sibling, 1 reply; 21+ messages in thread
From: Suresh Siddha @ 2010-10-15 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikhil Rao, Ingo Molnar, Mike Galbraith, Venkatesh Pallipadi,
	linux-kernel

On Fri, 2010-10-15 at 10:05 -0700, Peter Zijlstra wrote:
> For regular balancing local_group will be the first, since we only
> ascend the domain tree on the local groups. But its not true for no_hz
> balancing afaikt.

Even for NOHZ, we always ascend each cpu's sched domain and the local
group is the first one always. But yes, we are depending on the local
group being the first group.

thanks,
suresh


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-15 17:13           ` Suresh Siddha
@ 2010-10-15 17:24             ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-10-15 17:24 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Nikhil Rao, Ingo Molnar, Mike Galbraith, Venkatesh Pallipadi,
	linux-kernel

On Fri, 2010-10-15 at 10:13 -0700, Suresh Siddha wrote:
> On Fri, 2010-10-15 at 10:05 -0700, Peter Zijlstra wrote:
> > For regular balancing local_group will be the first, since we only
> > ascend the domain tree on the local groups. But its not true for no_hz
> > balancing afaikt.
> 
> Even for NOHZ, we always ascend each cpu's sched domain and the local
> group is the first one always. But yes, we are depending on the local
> group being the first group.

Ah, yes, we take the balance_cpu's domain tree, not the local cpu's
domain tee.

Hrm,.. ok feels slightly tricky though.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity
  2010-10-15 17:05         ` Peter Zijlstra
  2010-10-15 17:13           ` Suresh Siddha
@ 2010-10-15 17:27           ` Nikhil Rao
  1 sibling, 0 replies; 21+ messages in thread
From: Nikhil Rao @ 2010-10-15 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
	linux-kernel

On Fri, Oct 15, 2010 at 10:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-10-15 at 09:13 -0700, Nikhil Rao wrote:
>
>> >> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> >> index 0dd1021..da0c688 100644
>> >> --- a/kernel/sched_fair.c
>> >> +++ b/kernel/sched_fair.c
>> >> @@ -2030,6 +2030,7 @@ struct sd_lb_stats {
>> >>       unsigned long this_load;
>> >>       unsigned long this_load_per_task;
>> >>       unsigned long this_nr_running;
>> >> +     unsigned long this_group_capacity;
>> >>
>> >>       /* Statistics of the busiest group */
>> >>       unsigned long max_load;
>> >> @@ -2546,15 +2547,18 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
>> >>               /*
>> >>                * In case the child domain prefers tasks go to siblings
>> >>                * first, lower the sg capacity to one so that we'll try
>> >> -              * and move all the excess tasks away.
>> >> +              * and move all the excess tasks away. We lower capacity only
>> >> +              * if the local group can handle the extra capacity.
>> >>                */
>> >> -             if (prefer_sibling)
>> >> +             if (prefer_sibling && !local_group &&
>> >> +                 sds->this_nr_running < sds->this_group_capacity)
>> >>                       sgs.group_capacity = min(sgs.group_capacity, 1UL);
>> >>
>> >>               if (local_group) {
>> >>                       sds->this_load = sgs.avg_load;
>> >>                       sds->this = sg;
>> >>                       sds->this_nr_running = sgs.sum_nr_running;
>> >> +                     sds->this_group_capacity = sgs.group_capacity;
>> >>                       sds->this_load_per_task = sgs.sum_weighted_load;
>> >>               } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
>> >>                       sds->max_load = sgs.avg_load;
>
> OK, but then you assume that local_group will always be the first group
> served, nor is there any purpose for adding sds->this_group_capacity,
> you could keep that local to this function.
>

Yes, this patch makes the assumption that local_group is the first.

About this_group_capacity, yes -- we don't need the additional field in
sg_lb_stats. We can make it local to this function. I just realized that if
we re-order the patches, we can reuse sgs.has_capacity from the next patch.

> For regular balancing local_group will be the first, since we only
> ascend the domain tree on the local groups. But its not true for no_hz
> balancing afaikt.
>

As Suresh points out, even with NOHZ, the local_group is the first group
since we ascend the per-cpu sched domain. I can add this into the comments
to make it clear.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip:sched/core] sched: Set group_imb only a task can be pulled from the busiest cpu
  2010-10-13 19:09 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
@ 2010-10-18 19:23   ` tip-bot for Nikhil Rao
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Nikhil Rao @ 2010-10-18 19:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, ncrao, tglx, mingo

Commit-ID:  2582f0eba54066b5e98ff2b27ef0cfa833b59f54
Gitweb:     http://git.kernel.org/tip/2582f0eba54066b5e98ff2b27ef0cfa833b59f54
Author:     Nikhil Rao <ncrao@google.com>
AuthorDate: Wed, 13 Oct 2010 12:09:36 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:17 +0200

sched: Set group_imb only a task can be pulled from the busiest cpu

When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.

Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
[ small code readability edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bf87192..3656480 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2378,7 +2378,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			int local_group, const struct cpumask *cpus,
 			int *balance, struct sg_lb_stats *sgs)
 {
-	unsigned long load, max_cpu_load, min_cpu_load;
+	unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
 	int i;
 	unsigned int balance_cpu = -1, first_idle_cpu = 0;
 	unsigned long avg_load_per_task = 0;
@@ -2389,6 +2389,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	/* Tally up the load of all CPUs in the group */
 	max_cpu_load = 0;
 	min_cpu_load = ~0UL;
+	max_nr_running = 0;
 
 	for_each_cpu_and(i, sched_group_cpus(group), cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -2406,8 +2407,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			load = target_load(i, load_idx);
 		} else {
 			load = source_load(i, load_idx);
-			if (load > max_cpu_load)
+			if (load > max_cpu_load) {
 				max_cpu_load = load;
+				max_nr_running = rq->nr_running;
+			}
 			if (min_cpu_load > load)
 				min_cpu_load = load;
 		}
@@ -2447,11 +2450,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	if (sgs->sum_nr_running)
 		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
+	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity =
-		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-10-18 19:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-13 19:09 [PATCH 0/4][RFC v2] Improve load balancing when tasks have large weight differential Nikhil Rao
2010-10-13 19:09 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
2010-10-13 19:09 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
2010-10-18 19:23   ` [tip:sched/core] sched: Set " tip-bot for Nikhil Rao
2010-10-13 19:09 ` [PATCH 3/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
2010-10-14  5:48   ` Nikhil Rao
2010-10-14 23:42     ` Suresh Siddha
2010-10-15 11:50     ` Peter Zijlstra
2010-10-15 16:13       ` Nikhil Rao
2010-10-15 17:05         ` Peter Zijlstra
2010-10-15 17:13           ` Suresh Siddha
2010-10-15 17:24             ` Peter Zijlstra
2010-10-15 17:27           ` Nikhil Rao
2010-10-13 19:09 ` [PATCH 4/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
2010-10-15 12:06   ` Peter Zijlstra
2010-10-15 12:18     ` Mike Galbraith
2010-10-15 12:20       ` Peter Zijlstra
2010-10-15 12:35         ` Mike Galbraith
2010-10-15 16:19           ` Nikhil Rao
2010-10-15 12:08   ` Peter Zijlstra
2010-10-15 16:20     ` Nikhil Rao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.