linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Various load-balance cleanups/optimizations
@ 2013-08-16 10:12 Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 1/6] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Peter Zijlstra

Hi,

The first 3 are reworks of the patches Joonsoo posted with fixes/changes, the
second three are further cleanups that I spotted while working on the patches
from Joonsoo.

There was enough wrong with the patches from Joonsoo that I feel somewhat
hesitant to merge them now and request people to stare at them in detail to
make sure there's no other weird stuff I've missed.

Thanks.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6] sched: Remove one division operation in find_busiest_queue()
  2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
@ 2013-08-16 10:12 ` Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 2/6] sched: Factor out code to should_we_balance() Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: joonsoo_kim-sched-remove_one_division_operation_in_find_buiest_queue.patch --]
[-- Type: text/plain, Size: 1165 bytes --]

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Remove one division operation in find_busiest_queue() by using
crosswise multiplication:

	wl_i / power_i > wl_j / power_j := 
	wl_i * power_j > wl_j * power_i

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[peterz: expanded changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5018,7 +5018,7 @@ static struct rq *find_busiest_queue(str
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long max_load = 0;
+	unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
 	int i;
 
 	for_each_cpu(i, sched_group_cpus(group)) {
@@ -5049,10 +5049,9 @@ static struct rq *find_busiest_queue(str
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_POWER_SCALE) / power;
-
-		if (wl > max_load) {
-			max_load = wl;
+		if (wl * busiest_power > busiest_load * power) {
+			busiest_load = wl;
+			busiest_power = power;
 			busiest = rq;
 		}
 	}



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/6] sched: Factor out code to should_we_balance()
  2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 1/6] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
@ 2013-08-16 10:12 ` Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 3/6] sched: Clean-up struct sd_lb_stat Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: joonsoo_kim-sched-factor_out_code_to_should_we_balance.patch --]
[-- Type: text/plain, Size: 8966 bytes --]

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now checking whether this cpu is appropriate to balance or not
is embedded into update_sg_lb_stats() and this checking has no direct
relationship to this function. There is not enough reason to place
this checking at update_sg_lb_stats(), except saving one iteration
for sched_group_cpus.

In this patch, I factor out this checking to should_we_balance() function.
And before doing actual work for load_balancing, check whether this cpu is
appropriate to balance via should_we_balance(). If this cpu is not
a candidate for balancing, it quit the work immediately.

With this change, we can save two memset cost and can expect better
compiler optimization.

Below is result of this patch.

* Vanilla *
   text	   data	    bss	    dec	    hex	filename
  34499	   1136	    116	  35751	   8ba7	kernel/sched/fair.o

* Patched *
   text	   data	    bss	    dec	    hex	filename
  34243	   1136	    116	  35495	   8aa7	kernel/sched/fair.o

In addition, rename @balance to @should_balance in order to represent
its purpose more clearly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[peterz: style changes and a fix in should_we_balance() ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |  111 +++++++++++++++++++++++++---------------------------
 1 file changed, 54 insertions(+), 57 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4510,22 +4510,17 @@ fix_small_capacity(struct sched_domain *
  * @group: sched_group whose statistics are to be updated.
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
-			int local_group, int *balance, struct sg_lb_stats *sgs)
+			int local_group, struct sg_lb_stats *sgs)
 {
 	unsigned long nr_running, max_nr_running, min_nr_running;
 	unsigned long load, max_cpu_load, min_cpu_load;
-	unsigned int balance_cpu = -1, first_idle_cpu = 0;
 	unsigned long avg_load_per_task = 0;
 	int i;
 
-	if (local_group)
-		balance_cpu = group_balance_cpu(group);
-
 	/* Tally up the load of all CPUs in the group */
 	max_cpu_load = 0;
 	min_cpu_load = ~0UL;
@@ -4538,15 +4533,9 @@ static inline void update_sg_lb_stats(st
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group) {
-			if (idle_cpu(i) && !first_idle_cpu &&
-					cpumask_test_cpu(i, sched_group_mask(group))) {
-				first_idle_cpu = 1;
-				balance_cpu = i;
-			}
-
+		if (local_group)
 			load = target_load(i, load_idx);
-		} else {
+		else {
 			load = source_load(i, load_idx);
 			if (load > max_cpu_load)
 				max_cpu_load = load;
@@ -4566,22 +4555,9 @@ static inline void update_sg_lb_stats(st
 			sgs->idle_cpus++;
 	}
 
-	/*
-	 * First idle cpu or the first cpu(busiest) in this sched group
-	 * is eligible for doing load balancing at this and above
-	 * domains. In the newly idle case, we will allow all the cpu's
-	 * to do the newly idle load balance.
-	 */
-	if (local_group) {
-		if (env->idle != CPU_NEWLY_IDLE) {
-			if (balance_cpu != env->dst_cpu) {
-				*balance = 0;
-				return;
-			}
-			update_group_power(env->sd, env->dst_cpu);
-		} else if (time_after_eq(jiffies, group->sgp->next_update))
-			update_group_power(env->sd, env->dst_cpu);
-	}
+	if (local_group && (env->idle != CPU_NEWLY_IDLE ||
+			time_after_eq(jiffies, group->sgp->next_update)))
+		update_group_power(env->sd, env->dst_cpu);
 
 	/* Adjust by relative CPU power of the group */
 	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
@@ -4663,7 +4639,7 @@ static bool update_sd_pick_busiest(struc
  * @sds: variable to hold the statistics for this sched_domain.
  */
 static inline void update_sd_lb_stats(struct lb_env *env,
-					int *balance, struct sd_lb_stats *sds)
+					struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -4680,10 +4656,7 @@ static inline void update_sd_lb_stats(st
 
 		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
 		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
-
-		if (local_group && !(*balance))
-			return;
+		update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
 
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += sg->sgp->power;
@@ -4916,8 +4889,6 @@ static inline void calculate_imbalance(s
  * to restore balance.
  *
  * @env: The load balancing environment.
- * @balance: Pointer to a variable indicating if this_cpu
- *	is the appropriate cpu to perform load balancing at this_level.
  *
  * Return:	- The busiest group if imbalance exists.
  *		- If no imbalance and user has opted for power-savings balance,
@@ -4925,7 +4896,7 @@ static inline void calculate_imbalance(s
  *		   put to idle by rebalancing its tasks onto our group.
  */
 static struct sched_group *
-find_busiest_group(struct lb_env *env, int *balance)
+find_busiest_group(struct lb_env *env)
 {
 	struct sd_lb_stats sds;
 
@@ -4935,14 +4906,7 @@ find_busiest_group(struct lb_env *env, i
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
 	 */
-	update_sd_lb_stats(env, balance, &sds);
-
-	/*
-	 * this_cpu is not the appropriate cpu to perform load balancing at
-	 * this level.
-	 */
-	if (!(*balance))
-		goto ret;
+	update_sd_lb_stats(env, &sds);
 
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
@@ -5006,7 +4970,6 @@ find_busiest_group(struct lb_env *env, i
 	return sds.busiest;
 
 out_balanced:
-ret:
 	env->imbalance = 0;
 	return NULL;
 }
@@ -5088,13 +5051,47 @@ static int need_active_balance(struct lb
 
 static int active_load_balance_cpu_stop(void *data);
 
+static int should_we_balance(struct lb_env *env)
+{
+	struct sched_group *sg = env->sd->groups;
+	struct cpumask *sg_cpus, *sg_mask;
+	int cpu, balance_cpu = -1;
+
+	/*
+	 * In the newly idle case, we will allow all the cpu's
+	 * to do the newly idle load balance.
+	 */
+	if (env->idle == CPU_NEWLY_IDLE)
+		return 1;
+
+	sg_cpus = sched_group_cpus(sg);
+	sg_mask = sched_group_mask(sg);
+	/* Try to find first idle cpu */
+	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
+		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+			continue;
+
+		balance_cpu = cpu;
+		break;
+	}
+
+	if (balance_cpu == -1)
+		balance_cpu = group_balance_cpu(sg);
+
+	/*
+	 * First idle cpu or the first cpu(busiest) in this sched group
+	 * is eligible for doing load balancing at this and above domains.
+	 */
+	return balance_cpu != env->dst_cpu;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
 static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *balance)
+			int *should_balance)
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
 	struct sched_group *group;
@@ -5123,12 +5120,11 @@ static int load_balance(int this_cpu, st
 
 	schedstat_inc(sd, lb_count[idle]);
 
-redo:
-	group = find_busiest_group(&env, balance);
-
-	if (*balance == 0)
+	if (!(*should_balance = should_we_balance(&env)))
 		goto out_balanced;
 
+redo:
+	group = find_busiest_group(&env);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
@@ -5340,7 +5336,7 @@ void idle_balance(int this_cpu, struct r
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
-		int balance = 1;
+		int should_balance;
 
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5348,7 +5344,8 @@ void idle_balance(int this_cpu, struct r
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 			/* If we've pulled tasks over stop searching: */
 			pulled_task = load_balance(this_cpu, this_rq,
-						   sd, CPU_NEWLY_IDLE, &balance);
+						   sd, CPU_NEWLY_IDLE,
+						   &should_balance);
 		}
 
 		interval = msecs_to_jiffies(sd->balance_interval);
@@ -5586,7 +5583,7 @@ void update_max_interval(void)
  */
 static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 {
-	int balance = 1;
+	int should_balance = 1;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
@@ -5618,7 +5615,7 @@ static void rebalance_domains(int cpu, e
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(cpu, rq, sd, idle, &balance)) {
+			if (load_balance(cpu, rq, sd, idle, &should_balance)) {
 				/*
 				 * The LBF_SOME_PINNED logic could have changed
 				 * env->dst_cpu, so we can't know our idle
@@ -5641,7 +5638,7 @@ static void rebalance_domains(int cpu, e
 		 * CPU in our sched group which is doing load balancing more
 		 * actively.
 		 */
-		if (!balance)
+		if (!should_balance)
 			break;
 	}
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/6] sched: Clean-up struct sd_lb_stat
  2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 1/6] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 2/6] sched: Factor out code to should_we_balance() Peter Zijlstra
@ 2013-08-16 10:12 ` Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 4/6] sched, fair: Remove duplicate load_per_task computations Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: joonsoo_kim-sched-clean-up_struct_sd_lb_stat.patch --]
[-- Type: text/plain, Size: 15761 bytes --]

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

There is no reason to maintain separate variables for this_group
and busiest_group in sd_lb_stat, except saving some space.
But this structure is always allocated in stack, so this saving
isn't really benificial [peterz: reducing stack space is good; in this
case readability increases enough that I think its still beneficial]

This patch unify these variables, so IMO, readability may be improved.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[peterz: lots of style edits, a few fixes, a rename and
         a further memset optimization ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |  229 +++++++++++++++++++++++++---------------------------
 1 file changed, 114 insertions(+), 115 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct
 
 /********** Helpers for find_busiest_group ************************/
 /*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
@@ -4314,6 +4284,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long sum_nr_running; /* Nr tasks running in the group */
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long idle_cpus;
 	unsigned long group_weight;
@@ -4321,6 +4292,21 @@ struct sg_lb_stats {
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		 during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *this;	/* Local group in this sd */
+	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_pwr;	/* Total power of all groups in sd */
+	unsigned long avg_load;	/* Average load across all groups in sd */
+
+	struct sg_lb_stats this_stat;	/* Statistics of this group */
+	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+};
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4533,10 +4519,11 @@ static inline void update_sg_lb_stats(st
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group)
+		if (local_group) {
 			load = target_load(i, load_idx);
-		else {
+		} else {
 			load = source_load(i, load_idx);
+
 			if (load > max_cpu_load)
 				max_cpu_load = load;
 			if (min_cpu_load > load)
@@ -4578,10 +4565,12 @@ static inline void update_sg_lb_stats(st
 	    (max_nr_running - min_nr_running) > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
-						SCHED_POWER_SCALE);
+	sgs->group_capacity =
+		DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE);
+
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(env->sd, group);
+
 	sgs->group_weight = group->group_weight;
 
 	if (sgs->group_capacity > sgs->sum_nr_running)
@@ -4606,7 +4595,7 @@ static bool update_sd_pick_busiest(struc
 				   struct sched_group *sg,
 				   struct sg_lb_stats *sgs)
 {
-	if (sgs->avg_load <= sds->max_load)
+	if (sgs->avg_load <= sds->busiest_stat.avg_load)
 		return false;
 
 	if (sgs->sum_nr_running > sgs->group_capacity)
@@ -4643,7 +4632,7 @@ static inline void update_sd_lb_stats(st
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
-	struct sg_lb_stats sgs;
+	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
@@ -4652,14 +4641,17 @@ static inline void update_sd_lb_stats(st
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
+		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
 
 		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
-		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
+		if (local_group) {
+			sds->this = sg;
+			sgs = &sds->this_stat;
+		}
 
-		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->sgp->power;
+		memset(sgs, 0, sizeof(*sgs));
+		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -4671,26 +4663,17 @@ static inline void update_sd_lb_stats(st
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
-			sgs.group_capacity = min(sgs.group_capacity, 1UL);
+		if (prefer_sibling && !local_group &&
+				sds->this && sds->this_stat.group_has_capacity)
+			sgs->group_capacity = min(sgs->group_capacity, 1UL);
 
-		if (local_group) {
-			sds->this_load = sgs.avg_load;
-			sds->this = sg;
-			sds->this_nr_running = sgs.sum_nr_running;
-			sds->this_load_per_task = sgs.sum_weighted_load;
-			sds->this_has_capacity = sgs.group_has_capacity;
-			sds->this_idle_cpus = sgs.idle_cpus;
-		} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
-			sds->max_load = sgs.avg_load;
+		/* Now, start updating sd_lb_stats */
+		sds->total_load += sgs->group_load;
+		sds->total_pwr += sg->sgp->power;
+
+		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
-			sds->busiest_nr_running = sgs.sum_nr_running;
-			sds->busiest_idle_cpus = sgs.idle_cpus;
-			sds->busiest_group_capacity = sgs.group_capacity;
-			sds->busiest_load_per_task = sgs.sum_weighted_load;
-			sds->busiest_has_capacity = sgs.group_has_capacity;
-			sds->busiest_group_weight = sgs.group_weight;
-			sds->group_imb = sgs.group_imb;
+			sds->busiest_stat = *sgs;
 		}
 
 		sg = sg->next;
@@ -4734,8 +4717,8 @@ static int check_asym_packing(struct lb_
 	if (env->dst_cpu > busiest_cpu)
 		return 0;
 
-	env->imbalance = DIV_ROUND_CLOSEST(
-		sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
+	env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
+				sds->busiest->sgp->power, SCHED_POWER_SCALE);
 
 	return 1;
 }
@@ -4753,24 +4736,23 @@ void fix_small_imbalance(struct lb_env *
 	unsigned long tmp, pwr_now = 0, pwr_move = 0;
 	unsigned int imbn = 2;
 	unsigned long scaled_busy_load_per_task;
+	struct sg_lb_stats *this, *busiest;
 
-	if (sds->this_nr_running) {
-		sds->this_load_per_task /= sds->this_nr_running;
-		if (sds->busiest_load_per_task >
-				sds->this_load_per_task)
-			imbn = 1;
-	} else {
-		sds->this_load_per_task =
-			cpu_avg_load_per_task(env->dst_cpu);
-	}
+	this = &sds->this_stat;
+	busiest = &sds->busiest_stat;
+
+	if (!this->sum_nr_running)
+		this->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
+	else if (busiest->load_per_task > this->load_per_task)
+		imbn = 1;
 
-	scaled_busy_load_per_task = sds->busiest_load_per_task
-					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->sgp->power;
+	scaled_busy_load_per_task =
+		(busiest->load_per_task * SCHED_POWER_SCALE) /
+		sds->busiest->sgp->power;
 
-	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
-			(scaled_busy_load_per_task * imbn)) {
-		env->imbalance = sds->busiest_load_per_task;
+	if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
+	    (scaled_busy_load_per_task * imbn)) {
+		env->imbalance = busiest->load_per_task;
 		return;
 	}
 
@@ -4781,33 +4763,36 @@ void fix_small_imbalance(struct lb_env *
 	 */
 
 	pwr_now += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load);
+			min(busiest->load_per_task, busiest->avg_load);
 	pwr_now += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load);
+			min(this->load_per_task, this->avg_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
+	tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
 		sds->busiest->sgp->power;
-	if (sds->max_load > tmp)
+	if (busiest->avg_load > tmp) {
 		pwr_move += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load - tmp);
+			    min(busiest->load_per_task,
+				busiest->avg_load - tmp);
+	}
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->sgp->power <
-		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->sgp->power) /
+	if (busiest->avg_load * sds->busiest->sgp->power <
+	    busiest->load_per_task * SCHED_POWER_SCALE) {
+		tmp = (busiest->avg_load * sds->busiest->sgp->power) /
 			sds->this->sgp->power;
-	else
-		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
+	} else {
+		tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
 			sds->this->sgp->power;
+	}
 	pwr_move += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load + tmp);
+			min(this->load_per_task, this->avg_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
-		env->imbalance = sds->busiest_load_per_task;
+		env->imbalance = busiest->load_per_task;
 }
 
 /**
@@ -4819,11 +4804,22 @@ void fix_small_imbalance(struct lb_env *
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	unsigned long max_pull, load_above_capacity = ~0UL;
+	struct sg_lb_stats *this, *busiest;
 
-	sds->busiest_load_per_task /= sds->busiest_nr_running;
-	if (sds->group_imb) {
-		sds->busiest_load_per_task =
-			min(sds->busiest_load_per_task, sds->avg_load);
+	this = &sds->this_stat;
+	if (this->sum_nr_running) {
+		this->load_per_task =
+			this->sum_weighted_load / this->sum_nr_running;
+	}
+
+	busiest = &sds->busiest_stat;
+	/* busiest must have some tasks */
+	busiest->load_per_task =
+		busiest->sum_weighted_load / busiest->sum_nr_running;
+
+	if (busiest->group_imb) {
+		busiest->load_per_task =
+			min(busiest->load_per_task, sds->avg_load);
 	}
 
 	/*
@@ -4831,20 +4827,19 @@ static inline void calculate_imbalance(s
 	 * max load less than avg load(as we skip the groups at or below
 	 * its cpu_power, while calculating max_load..)
 	 */
-	if (sds->max_load < sds->avg_load) {
+	if (busiest->avg_load < sds->avg_load) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
 
-	if (!sds->group_imb) {
+	if (!busiest->group_imb) {
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
 		 */
-		load_above_capacity = (sds->busiest_nr_running -
-						sds->busiest_group_capacity);
+		load_above_capacity =
+			(busiest->sum_nr_running - busiest->group_capacity);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
 		load_above_capacity /= sds->busiest->sgp->power;
 	}
 
@@ -4858,12 +4853,14 @@ static inline void calculate_imbalance(s
 	 * Be careful of negative numbers as they'll appear as very large values
 	 * with unsigned longs.
 	 */
-	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
+	max_pull = min(busiest->avg_load - sds->avg_load,
+		       load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	env->imbalance = min(max_pull * sds->busiest->sgp->power,
-		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
-			/ SCHED_POWER_SCALE;
+	env->imbalance = min(
+		max_pull * sds->busiest->sgp->power,
+		(sds->avg_load - this->avg_load) * sds->this->sgp->power
+	) / SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -4871,9 +4868,8 @@ static inline void calculate_imbalance(s
 	 * a think about bumping its value to force at least one task to be
 	 * moved
 	 */
-	if (env->imbalance < sds->busiest_load_per_task)
+	if (env->imbalance < busiest->load_per_task)
 		return fix_small_imbalance(env, sds);
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
@@ -4895,54 +4891,56 @@ static inline void calculate_imbalance(s
  *		   return the least loaded group whose CPUs can be
  *		   put to idle by rebalancing its tasks onto our group.
  */
-static struct sched_group *
-find_busiest_group(struct lb_env *env)
+static struct sched_group *find_busiest_group(struct lb_env *env)
 {
+	struct sg_lb_stats *this, *busiest;
 	struct sd_lb_stats sds;
 
-	memset(&sds, 0, sizeof(sds));
+	memset(&sds, 0, sizeof(sds) - 2*sizeof(struct sg_lb_stats));
 
 	/*
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
+	this = &sds.this_stat;
+	busiest = &sds.busiest_stat;
 
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || sds.busiest_nr_running == 0)
+	if (!sds.busiest || busiest->sum_nr_running == 0)
 		goto out_balanced;
 
-	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
+	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
 	 * work because they assumes all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
-	if (sds.group_imb)
+	if (busiest->group_imb)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
-			!sds.busiest_has_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && this->group_has_capacity &&
+	    !busiest->group_has_capacity)
 		goto force_balance;
 
 	/*
 	 * If the local group is more busy than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (sds.this_load >= sds.max_load)
+	if (this->avg_load >= busiest->avg_load)
 		goto out_balanced;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	if (sds.this_load >= sds.avg_load)
+	if (this->avg_load >= sds.avg_load)
 		goto out_balanced;
 
 	if (env->idle == CPU_IDLE) {
@@ -4952,15 +4950,16 @@ find_busiest_group(struct lb_env *env)
 		 * there is no imbalance between this and busiest group
 		 * wrt to idle cpu's, it is balanced.
 		 */
-		if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
-		    sds.busiest_nr_running <= sds.busiest_group_weight)
+		if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
+		    busiest->sum_nr_running <= busiest->group_weight)
 			goto out_balanced;
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 		 * imbalance_pct to be conservative.
 		 */
-		if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
+		if (100 * busiest->avg_load <=
+				env->sd->imbalance_pct * this->avg_load)
 			goto out_balanced;
 	}
 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/6] sched, fair: Remove duplicate load_per_task computations
  2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-08-16 10:12 ` [PATCH 3/6] sched: Clean-up struct sd_lb_stat Peter Zijlstra
@ 2013-08-16 10:12 ` Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 5/6] sched, fair: Make group power more consitent Peter Zijlstra
  2013-08-16 10:12 ` [PATCH 6/6] sched, fair: Rework and comment the group_imb code Peter Zijlstra
  5 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-more-fbg-cleanups.patch --]
[-- Type: text/plain, Size: 1542 bytes --]

Since we already compute (but don't store) the sgs load_per_task value
in update_sg_lb_stats() we might as well store it and not re-compute
it later on.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4504,7 +4504,6 @@ static inline void update_sg_lb_stats(st
 {
 	unsigned long nr_running, max_nr_running, min_nr_running;
 	unsigned long load, max_cpu_load, min_cpu_load;
-	unsigned long avg_load_per_task = 0;
 	int i;
 
 	/* Tally up the load of all CPUs in the group */
@@ -4559,9 +4558,9 @@ static inline void update_sg_lb_stats(st
 	 *      the hierarchy?
 	 */
 	if (sgs->sum_nr_running)
-		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
+	if ((max_cpu_load - min_cpu_load) >= sgs->load_per_task &&
 	    (max_nr_running - min_nr_running) > 1)
 		sgs->group_imb = 1;
 
@@ -4807,15 +4806,7 @@ static inline void calculate_imbalance(s
 	struct sg_lb_stats *this, *busiest;
 
 	this = &sds->this_stat;
-	if (this->sum_nr_running) {
-		this->load_per_task =
-			this->sum_weighted_load / this->sum_nr_running;
-	}
-
 	busiest = &sds->busiest_stat;
-	/* busiest must have some tasks */
-	busiest->load_per_task =
-		busiest->sum_weighted_load / busiest->sum_nr_running;
 
 	if (busiest->group_imb) {
 		busiest->load_per_task =



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 5/6] sched, fair: Make group power more consitent
  2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-08-16 10:12 ` [PATCH 4/6] sched, fair: Remove duplicate load_per_task computations Peter Zijlstra
@ 2013-08-16 10:12 ` Peter Zijlstra
  2013-08-19  4:17   ` Preeti U Murthy
  2013-08-16 10:12 ` [PATCH 6/6] sched, fair: Rework and comment the group_imb code Peter Zijlstra
  5 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-more-bfg-cleanups-2.patch --]
[-- Type: text/plain, Size: 4631 bytes --]

For easier access, less dereferences and more consisent value, store
the group power in update_sg_lb_stats() and use it thereafter. The
actual value in sched_group::sched_group_power::power can change
throughout the load-balance pass if we're unlucky.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4288,6 +4288,7 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long idle_cpus;
 	unsigned long group_weight;
+	unsigned long group_power;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
@@ -4546,7 +4547,8 @@ static inline void update_sg_lb_stats(st
 		update_group_power(env->sd, env->dst_cpu);
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
+	sgs->group_power = group->sgp->power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -4565,7 +4567,7 @@ static inline void update_sg_lb_stats(st
 		sgs->group_imb = 1;
 
 	sgs->group_capacity =
-		DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE);
+		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
 
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(env->sd, group);
@@ -4668,7 +4670,7 @@ static inline void update_sd_lb_stats(st
 
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
-		sds->total_pwr += sg->sgp->power;
+		sds->total_pwr += sgs->group_power;
 
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -4716,8 +4718,9 @@ static int check_asym_packing(struct lb_
 	if (env->dst_cpu > busiest_cpu)
 		return 0;
 
-	env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
-				sds->busiest->sgp->power, SCHED_POWER_SCALE);
+	env->imbalance = DIV_ROUND_CLOSEST(
+		sds->busiest_stat.avg_load * sds->busiest_stat.group_power,
+		SCHED_POWER_SCALE);
 
 	return 1;
 }
@@ -4747,7 +4750,7 @@ void fix_small_imbalance(struct lb_env *
 
 	scaled_busy_load_per_task =
 		(busiest->load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->sgp->power;
+		busiest->group_power;
 
 	if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
 	    (scaled_busy_load_per_task * imbn)) {
@@ -4761,32 +4764,32 @@ void fix_small_imbalance(struct lb_env *
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->sgp->power *
+	pwr_now += busiest->group_power *
 			min(busiest->load_per_task, busiest->avg_load);
-	pwr_now += sds->this->sgp->power *
+	pwr_now += this->group_power *
 			min(this->load_per_task, this->avg_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
 	tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->sgp->power;
+		busiest->group_power;
 	if (busiest->avg_load > tmp) {
-		pwr_move += sds->busiest->sgp->power *
+		pwr_move += busiest->group_power *
 			    min(busiest->load_per_task,
 				busiest->avg_load - tmp);
 	}
 
 	/* Amount of load we'd add */
-	if (busiest->avg_load * sds->busiest->sgp->power <
+	if (busiest->avg_load * busiest->group_power <
 	    busiest->load_per_task * SCHED_POWER_SCALE) {
-		tmp = (busiest->avg_load * sds->busiest->sgp->power) /
-			sds->this->sgp->power;
+		tmp = (busiest->avg_load * busiest->group_power) /
+		      this->group_power;
 	} else {
 		tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
-			sds->this->sgp->power;
+		      this->group_power;
 	}
-	pwr_move += sds->this->sgp->power *
-			min(this->load_per_task, this->avg_load + tmp);
+	pwr_move += this->group_power *
+		    min(this->load_per_task, this->avg_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
@@ -4831,7 +4834,7 @@ static inline void calculate_imbalance(s
 			(busiest->sum_nr_running - busiest->group_capacity);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-		load_above_capacity /= sds->busiest->sgp->power;
+		load_above_capacity /= busiest->group_power;
 	}
 
 	/*
@@ -4849,8 +4852,8 @@ static inline void calculate_imbalance(s
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(
-		max_pull * sds->busiest->sgp->power,
-		(sds->avg_load - this->avg_load) * sds->this->sgp->power
+		max_pull * busiest->group_power,
+		(sds->avg_load - this->avg_load) * this->group_power
 	) / SCHED_POWER_SCALE;
 
 	/*



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 6/6] sched, fair: Rework and comment the group_imb code
  2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-08-16 10:12 ` [PATCH 5/6] sched, fair: Make group power more consitent Peter Zijlstra
@ 2013-08-16 10:12 ` Peter Zijlstra
  2013-08-16 10:34   ` Peter Zijlstra
  5 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:12 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Rik van Riel, Peter Zijlstra

[-- Attachment #1: peterz-group_imb-frubbery.patch --]
[-- Type: text/plain, Size: 6882 bytes --]

Rik reported some weirdness due to the group_imb code. As a start to
looking at it, clean it up a little and add a few explanatory
comments.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |  123 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 89 insertions(+), 34 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4491,6 +4491,81 @@ fix_small_capacity(struct sched_domain *
 	return 0;
 }
 
+/*
+ * Group imbalance indicates (and tries to solve) the problem where balancing
+ * groups is inadequate due to tsk_cpus_allowed() constraints.
+ *
+ * Imagine a situation of two groups of 4 cpus each and 4 tasks each with a
+ * cpumask covering 1 cpu of the first group and 3 cpus of the second group.
+ * Something like:
+ *
+ * 	{ 0 1 2 3 } { 4 5 6 7 }
+ * 	        *     * * *
+ *
+ * If we were to balance group-wise we'd place two tasks in the first group and
+ * two tasks in the second group. Clearly this is undesired as it will overload
+ * cpu 3 and leave one of the cpus in the second group unused.
+ *
+ * The current solution to this issue is detecting the skew in the first group
+ * by noticing it has a cpu that is overloaded while the remaining cpus are
+ * idle -- or rather, there's a distinct imbalance in the cpus; see
+ * sg_imbalanced().
+ *
+ * When this is so detected; this group becomes a candidate for busiest; see
+ * update_sd_pick_busiest(). And calculcate_imbalance() and
+ * find_busiest_group() avoid some of the usual balance conditional to allow it
+ * to create an effective group imbalance.
+ *
+ * This is a somewhat tricky proposition since the next run might not find the
+ * group imbalance and decide the groups need to be balanced again. A most
+ * subtle and fragile situation.
+ */
+
+struct sg_imb_stats {
+	unsigned long max_nr_running, min_nr_running;
+	unsigned long max_cpu_load, min_cpu_load;
+};
+
+static inline void init_sg_imb_stats(struct sg_imb_stats *sgi)
+{
+	sgi->max_cpu_load = sgi->max_nr_running = 0UL;
+	sgi->min_cpu_load = sgi->min_nr_running = ~0UL;
+}
+
+static inline void
+update_sg_imb_stats(struct sg_imb_stats *sgi,
+		    unsigned long load, unsigned long nr_running)
+{
+	if (load > sgi->max_cpu_load)
+		sgi->max_cpu_load = load;
+	if (sgi->min_cpu_load > load)
+		sgi->min_cpu_load = load;
+
+	if (nr_running > sgi->max_nr_running)
+		sgi->max_nr_running = nr_running;
+	if (sgi->min_nr_running > nr_running)
+		sgi->min_nr_running = nr_running;
+}
+
+static inline int
+sg_imbalanced(struct sg_lb_stats *sgs, struct sg_imb_stats *sgi)
+{
+	/*
+	 * Consider the group unbalanced when the imbalance is larger
+	 * than the average weight of a task.
+	 *
+	 * APZ: with cgroup the avg task weight can vary wildly and
+	 *      might not be a suitable number - should we keep a
+	 *      normalized nr_running number somewhere that negates
+	 *      the hierarchy?
+	 */
+	if ((sgi->max_cpu_load - sgi->min_cpu_load) >= sgs->load_per_task &&
+	    (sgi->max_nr_running - sgi->min_nr_running) > 1)
+		return 1;
+
+	return 0;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -4503,15 +4578,12 @@ static inline void update_sg_lb_stats(st
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs)
 {
-	unsigned long nr_running, max_nr_running, min_nr_running;
-	unsigned long load, max_cpu_load, min_cpu_load;
+	struct sg_imb_stats sgi;
+	unsigned long nr_running;
+	unsigned long load;
 	int i;
 
-	/* Tally up the load of all CPUs in the group */
-	max_cpu_load = 0;
-	min_cpu_load = ~0UL;
-	max_nr_running = 0;
-	min_nr_running = ~0UL;
+	init_sg_imb_stats(&sgi);
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -4523,16 +4595,7 @@ static inline void update_sg_lb_stats(st
 			load = target_load(i, load_idx);
 		} else {
 			load = source_load(i, load_idx);
-
-			if (load > max_cpu_load)
-				max_cpu_load = load;
-			if (min_cpu_load > load)
-				min_cpu_load = load;
-
-			if (nr_running > max_nr_running)
-				max_nr_running = nr_running;
-			if (min_nr_running > nr_running)
-				min_nr_running = nr_running;
+			update_sg_imb_stats(&sgi, load, nr_running);
 		}
 
 		sgs->group_load += load;
@@ -4550,21 +4613,10 @@ static inline void update_sg_lb_stats(st
 	sgs->group_power = group->sgp->power;
 	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
 
-	/*
-	 * Consider the group unbalanced when the imbalance is larger
-	 * than the average weight of a task.
-	 *
-	 * APZ: with cgroup the avg task weight can vary wildly and
-	 *      might not be a suitable number - should we keep a
-	 *      normalized nr_running number somewhere that negates
-	 *      the hierarchy?
-	 */
 	if (sgs->sum_nr_running)
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) >= sgs->load_per_task &&
-	    (max_nr_running - min_nr_running) > 1)
-		sgs->group_imb = 1;
+	sgs->group_imb = sg_imbalanced(sgs, &sgi);
 
 	sgs->group_capacity =
 		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
@@ -4812,6 +4864,10 @@ static inline void calculate_imbalance(s
 	busiest = &sds->busiest_stat;
 
 	if (busiest->group_imb) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure cpu-load equilibrium, look at wider averages. XXX
+		 */
 		busiest->load_per_task =
 			min(busiest->load_per_task, sds->avg_load);
 	}
@@ -4829,6 +4885,8 @@ static inline void calculate_imbalance(s
 	if (!busiest->group_imb) {
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
+		 * Except of course for the group_imb case, since then we might
+		 * have to drop below capacity to reach cpu-load equilibrium.
 		 */
 		load_above_capacity =
 			(busiest->sum_nr_running - busiest->group_capacity);
@@ -4844,11 +4902,8 @@ static inline void calculate_imbalance(s
 	 * we also don't want to reduce the group load below the group capacity
 	 * (so that we can implement power-savings policies etc). Thus we look
 	 * for the minimum possible imbalance.
-	 * Be careful of negative numbers as they'll appear as very large values
-	 * with unsigned longs.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load,
-		       load_above_capacity);
+	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(
@@ -4912,7 +4967,7 @@ static struct sched_group *find_busiest_
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
-	 * work because they assumes all things are equal, which typically
+	 * work because they assume all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
 	if (busiest->group_imb)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/6] sched, fair: Rework and comment the group_imb code
  2013-08-16 10:12 ` [PATCH 6/6] sched, fair: Rework and comment the group_imb code Peter Zijlstra
@ 2013-08-16 10:34   ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-16 10:34 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Joonsoo Kim, Rik van Riel

On Fri, Aug 16, 2013 at 12:12:27PM +0200, Peter Zijlstra wrote:
> +/*
> + * Group imbalance indicates (and tries to solve) the problem where balancing
> + * groups is inadequate due to tsk_cpus_allowed() constraints.
> + *
> + * Imagine a situation of two groups of 4 cpus each and 4 tasks each with a
> + * cpumask covering 1 cpu of the first group and 3 cpus of the second group.
> + * Something like:
> + *
> + * 	{ 0 1 2 3 } { 4 5 6 7 }
> + * 	        *     * * *

And yes, people think this is a reasonable thing to do :/

> + *
> + * If we were to balance group-wise we'd place two tasks in the first group and
> + * two tasks in the second group. Clearly this is undesired as it will overload
> + * cpu 3 and leave one of the cpus in the second group unused.
> + *
> + * The current solution to this issue is detecting the skew in the first group
> + * by noticing it has a cpu that is overloaded while the remaining cpus are
> + * idle -- or rather, there's a distinct imbalance in the cpus; see
> + * sg_imbalanced().
> + *
> + * When this is so detected; this group becomes a candidate for busiest; see
> + * update_sd_pick_busiest(). And calculcate_imbalance() and
> + * find_busiest_group() avoid some of the usual balance conditional to allow it
> + * to create an effective group imbalance.
> + *
> + * This is a somewhat tricky proposition since the next run might not find the
> + * group imbalance and decide the groups need to be balanced again. A most
> + * subtle and fragile situation.
> + */

One of the things that I know about which can (and does) go wrong with
this is that we typically pull 'all' tasks to the balance cpu only to
then let future (lower) load-balance passes spread that out again.

But until its spread out, we have this spike in the task distribution
that will trigger the group_imb condition.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] sched, fair: Make group power more consitent
  2013-08-16 10:12 ` [PATCH 5/6] sched, fair: Make group power more consitent Peter Zijlstra
@ 2013-08-19  4:17   ` Preeti U Murthy
  2013-08-19 10:30     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Preeti U Murthy @ 2013-08-19  4:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Joonsoo Kim

Hi Peter,

On 08/16/2013 03:42 PM, Peter Zijlstra wrote:

I have a few comments and clarification to seek.

1. How are you ensuring from this patch that sgs->group_power does not
change over the course of load balancing?

    The only path to update_group_power() where sg->sgp->power gets
updated, is from update_sg_lb_stats(). You are updating sgs->group_power
in update_sg_lb_stats(). Any change to group->sgp->power will get
reflected in sgs->group_power as well right?

2. This point is aside from your patch. In the current implementation,
each time the cpu power gets updated in update_cpu_power(), should not
the power of the sched_groups comprising of that cpu also get updated?
Why wait till the load balancing is done at the sched_domain level of
that group, to update its group power?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] sched, fair: Make group power more consitent
  2013-08-19  4:17   ` Preeti U Murthy
@ 2013-08-19 10:30     ` Peter Zijlstra
  2013-08-20  5:24       ` Preeti U Murthy
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-19 10:30 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Joonsoo Kim

On Mon, Aug 19, 2013 at 09:47:47AM +0530, Preeti U Murthy wrote:
> Hi Peter,
> 
> On 08/16/2013 03:42 PM, Peter Zijlstra wrote:
> 
> I have a few comments and clarification to seek.
> 
> 1. How are you ensuring from this patch that sgs->group_power does not
> change over the course of load balancing?

Well, we only set it the one time when creating the sgs data.

>     The only path to update_group_power() where sg->sgp->power gets
> updated, is from update_sg_lb_stats(). You are updating sgs->group_power
> in update_sg_lb_stats(). Any change to group->sgp->power will get
> reflected in sgs->group_power as well right?

Nope, we set it to whatever value group->sgp->power is at the time of
sgs 'creation' and live with that value from then on. We do this after
the possible update_group_power() call.

That said, it is very rare to have group->sgp->power change during the
load-balance pass, we would have to trigger the time_after case for
NEWLY_IDLE and then get a concurrent !NEWLY_IDLE load-balance pass.

This patch takes away that possibility and uses a consistent group power
reading for the entire load-balance invocation as well as does away with
that double dereference all the time.

> 2. This point is aside from your patch. In the current implementation,
> each time the cpu power gets updated in update_cpu_power(), should not
> the power of the sched_groups comprising of that cpu also get updated?
> Why wait till the load balancing is done at the sched_domain level of
> that group, to update its group power?

What would be the advantage of doing so? We also take snapshots of
cpu/group/domain load, we don't update those either. Having all that
dynamically update during the load-balance pass would make it an
impossible situation. 

You'd fail to meet progress guarantees that way since you'd never be
able to pin-point a 'busiest' group/cpu because by the time you've made
any decision you have to go back to make it again because things might
have changed again.

So instead what we do is we take a snapshot and live with that state. If
the values change so fast that our load-balance pass is invalid by the
time we're finished, too bad, better luck next time.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] sched, fair: Make group power more consitent
  2013-08-19 10:30     ` Peter Zijlstra
@ 2013-08-20  5:24       ` Preeti U Murthy
  0 siblings, 0 replies; 11+ messages in thread
From: Preeti U Murthy @ 2013-08-20  5:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Joonsoo Kim

Hi Peter,
Thank you for the clarification.

On 08/19/2013 04:00 PM, Peter Zijlstra wrote:
> On Mon, Aug 19, 2013 at 09:47:47AM +0530, Preeti U Murthy wrote:
>> Hi Peter,
>>
>> On 08/16/2013 03:42 PM, Peter Zijlstra wrote:
>>
>> I have a few comments and clarification to seek.
>>
>> 1. How are you ensuring from this patch that sgs->group_power does not
>> change over the course of load balancing?
> 
> Well, we only set it the one time when creating the sgs data.
> 
>>     The only path to update_group_power() where sg->sgp->power gets
>> updated, is from update_sg_lb_stats(). You are updating sgs->group_power
>> in update_sg_lb_stats(). Any change to group->sgp->power will get
>> reflected in sgs->group_power as well right?
> 
> Nope, we set it to whatever value group->sgp->power is at the time of
> sgs 'creation' and live with that value from then on. We do this after
> the possible update_group_power() call.
> 
> That said, it is very rare to have group->sgp->power change during the
> load-balance pass, we would have to trigger the time_after case for
> NEWLY_IDLE and then get a concurrent !NEWLY_IDLE load-balance pass.
> 
> This patch takes away that possibility and uses a consistent group power
> reading for the entire load-balance invocation as well as does away with
> that double dereference all the time.

Fair enough. I overlooked the fact that "group" can be manipulated by
multiple load balancing passes whereas "sgs" is created during every
load balance pass.

> 
>> 2. This point is aside from your patch. In the current implementation,
>> each time the cpu power gets updated in update_cpu_power(), should not
>> the power of the sched_groups comprising of that cpu also get updated?
>> Why wait till the load balancing is done at the sched_domain level of
>> that group, to update its group power?
> 
> What would be the advantage of doing so? We also take snapshots of
> cpu/group/domain load, we don't update those either. Having all that
> dynamically update during the load-balance pass would make it an
> impossible situation. 
> 
> You'd fail to meet progress guarantees that way since you'd never be
> able to pin-point a 'busiest' group/cpu because by the time you've made
> any decision you have to go back to make it again because things might
> have changed again.
> 
> So instead what we do is we take a snapshot and live with that state. If
> the values change so fast that our load-balance pass is invalid by the
> time we're finished, too bad, better luck next time.

Agree. Thank you for the clarifications.

Regards
Preeti U Murthy
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-20  5:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-16 10:12 [PATCH 0/6] Various load-balance cleanups/optimizations Peter Zijlstra
2013-08-16 10:12 ` [PATCH 1/6] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
2013-08-16 10:12 ` [PATCH 2/6] sched: Factor out code to should_we_balance() Peter Zijlstra
2013-08-16 10:12 ` [PATCH 3/6] sched: Clean-up struct sd_lb_stat Peter Zijlstra
2013-08-16 10:12 ` [PATCH 4/6] sched, fair: Remove duplicate load_per_task computations Peter Zijlstra
2013-08-16 10:12 ` [PATCH 5/6] sched, fair: Make group power more consitent Peter Zijlstra
2013-08-19  4:17   ` Preeti U Murthy
2013-08-19 10:30     ` Peter Zijlstra
2013-08-20  5:24       ` Preeti U Murthy
2013-08-16 10:12 ` [PATCH 6/6] sched, fair: Rework and comment the group_imb code Peter Zijlstra
2013-08-16 10:34   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).