[PATCH 00/10] Various load-balance cleanups/optimizations -v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/10] Various load-balance cleanups/optimizations -v2
@ 2013-08-19 16:00 Peter Zijlstra
  2013-08-19 16:00 ` [PATCH 01/10] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:00 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel, Peter Zijlstra

After poking at them a little more I feel somewhat more confident.

I found one more bug, but this one was my own fault, we should also clear
sds->busiest_stat.avg_load because update_sd_pick_busiest() reads that before
we set it.

Moved the memset optimization and the fix for that into a separate patch.

Other than that, there's a sched_domain degenerate fix; and a new way to detect
group_imb (which relies on the sched_domain fix).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/10] sched: Remove one division operation in find_busiest_queue()
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
@ 2013-08-19 16:00 ` Peter Zijlstra
  2013-08-22  8:58   ` Paul Turner
  2013-08-19 16:01 ` [PATCH 02/10] sched: Factor out code to should_we_balance() Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:00 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel, Peter Zijlstra

[-- Attachment #1: joonsoo_kim-sched-remove_one_division_operation_in_find_buiest_queue.patch --]
[-- Type: text/plain, Size: 1253 bytes --]

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Remove one division operation in find_busiest_queue() by using
crosswise multiplication:

	wl_i / power_i > wl_j / power_j := 
	wl_i * power_j > wl_j * power_i

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[peterz: expanded changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375778203-31343-2-git-send-email-iamjoonsoo.kim@lge.com
---
 kernel/sched/fair.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5018,7 +5018,7 @@ static struct rq *find_busiest_queue(str
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long max_load = 0;
+	unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
 	int i;
 
 	for_each_cpu(i, sched_group_cpus(group)) {
@@ -5049,10 +5049,9 @@ static struct rq *find_busiest_queue(str
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_POWER_SCALE) / power;
-
-		if (wl > max_load) {
-			max_load = wl;
+		if (wl * busiest_power > busiest_load * power) {
+			busiest_load = wl;
+			busiest_power = power;
 			busiest = rq;
 		}
 	}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 02/10] sched: Factor out code to should_we_balance()
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
  2013-08-19 16:00 ` [PATCH 01/10] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-22  9:58   ` Paul Turner
  2013-08-19 16:01 ` [PATCH 03/10] sched: Clean-up struct sd_lb_stat Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: joonsoo_kim-sched-factor_out_code_to_should_we_balance.patch --]
[-- Type: text/plain, Size: 9054 bytes --]

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now checking whether this cpu is appropriate to balance or not
is embedded into update_sg_lb_stats() and this checking has no direct
relationship to this function. There is not enough reason to place
this checking at update_sg_lb_stats(), except saving one iteration
for sched_group_cpus.

In this patch, I factor out this checking to should_we_balance() function.
And before doing actual work for load_balancing, check whether this cpu is
appropriate to balance via should_we_balance(). If this cpu is not
a candidate for balancing, it quit the work immediately.

With this change, we can save two memset cost and can expect better
compiler optimization.

Below is result of this patch.

* Vanilla *
   text	   data	    bss	    dec	    hex	filename
  34499	   1136	    116	  35751	   8ba7	kernel/sched/fair.o

* Patched *
   text	   data	    bss	    dec	    hex	filename
  34243	   1136	    116	  35495	   8aa7	kernel/sched/fair.o

In addition, rename @balance to @should_balance in order to represent
its purpose more clearly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[peterz: style changes and a fix in should_we_balance() ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375778203-31343-3-git-send-email-iamjoonsoo.kim@lge.com
---
 kernel/sched/fair.c |  111 +++++++++++++++++++++++++---------------------------
 1 file changed, 54 insertions(+), 57 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4510,22 +4510,17 @@ fix_small_capacity(struct sched_domain *
  * @group: sched_group whose statistics are to be updated.
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
-			int local_group, int *balance, struct sg_lb_stats *sgs)
+			int local_group, struct sg_lb_stats *sgs)
 {
 	unsigned long nr_running, max_nr_running, min_nr_running;
 	unsigned long load, max_cpu_load, min_cpu_load;
-	unsigned int balance_cpu = -1, first_idle_cpu = 0;
 	unsigned long avg_load_per_task = 0;
 	int i;
 
-	if (local_group)
-		balance_cpu = group_balance_cpu(group);
-
 	/* Tally up the load of all CPUs in the group */
 	max_cpu_load = 0;
 	min_cpu_load = ~0UL;
@@ -4538,15 +4533,9 @@ static inline void update_sg_lb_stats(st
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group) {
-			if (idle_cpu(i) && !first_idle_cpu &&
-					cpumask_test_cpu(i, sched_group_mask(group))) {
-				first_idle_cpu = 1;
-				balance_cpu = i;
-			}
-
+		if (local_group)
 			load = target_load(i, load_idx);
-		} else {
+		else {
 			load = source_load(i, load_idx);
 			if (load > max_cpu_load)
 				max_cpu_load = load;
@@ -4566,22 +4555,9 @@ static inline void update_sg_lb_stats(st
 			sgs->idle_cpus++;
 	}
 
-	/*
-	 * First idle cpu or the first cpu(busiest) in this sched group
-	 * is eligible for doing load balancing at this and above
-	 * domains. In the newly idle case, we will allow all the cpu's
-	 * to do the newly idle load balance.
-	 */
-	if (local_group) {
-		if (env->idle != CPU_NEWLY_IDLE) {
-			if (balance_cpu != env->dst_cpu) {
-				*balance = 0;
-				return;
-			}
-			update_group_power(env->sd, env->dst_cpu);
-		} else if (time_after_eq(jiffies, group->sgp->next_update))
-			update_group_power(env->sd, env->dst_cpu);
-	}
+	if (local_group && (env->idle != CPU_NEWLY_IDLE ||
+			time_after_eq(jiffies, group->sgp->next_update)))
+		update_group_power(env->sd, env->dst_cpu);
 
 	/* Adjust by relative CPU power of the group */
 	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
@@ -4663,7 +4639,7 @@ static bool update_sd_pick_busiest(struc
  * @sds: variable to hold the statistics for this sched_domain.
  */
 static inline void update_sd_lb_stats(struct lb_env *env,
-					int *balance, struct sd_lb_stats *sds)
+					struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -4680,10 +4656,7 @@ static inline void update_sd_lb_stats(st
 
 		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
 		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
-
-		if (local_group && !(*balance))
-			return;
+		update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
 
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += sg->sgp->power;
@@ -4916,8 +4889,6 @@ static inline void calculate_imbalance(s
  * to restore balance.
  *
  * @env: The load balancing environment.
- * @balance: Pointer to a variable indicating if this_cpu
- *	is the appropriate cpu to perform load balancing at this_level.
  *
  * Return:	- The busiest group if imbalance exists.
  *		- If no imbalance and user has opted for power-savings balance,
@@ -4925,7 +4896,7 @@ static inline void calculate_imbalance(s
  *		   put to idle by rebalancing its tasks onto our group.
  */
 static struct sched_group *
-find_busiest_group(struct lb_env *env, int *balance)
+find_busiest_group(struct lb_env *env)
 {
 	struct sd_lb_stats sds;
 
@@ -4935,14 +4906,7 @@ find_busiest_group(struct lb_env *env, i
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
 	 */
-	update_sd_lb_stats(env, balance, &sds);
-
-	/*
-	 * this_cpu is not the appropriate cpu to perform load balancing at
-	 * this level.
-	 */
-	if (!(*balance))
-		goto ret;
+	update_sd_lb_stats(env, &sds);
 
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
@@ -5006,7 +4970,6 @@ find_busiest_group(struct lb_env *env, i
 	return sds.busiest;
 
 out_balanced:
-ret:
 	env->imbalance = 0;
 	return NULL;
 }
@@ -5088,13 +5051,47 @@ static int need_active_balance(struct lb
 
 static int active_load_balance_cpu_stop(void *data);
 
+static int should_we_balance(struct lb_env *env)
+{
+	struct sched_group *sg = env->sd->groups;
+	struct cpumask *sg_cpus, *sg_mask;
+	int cpu, balance_cpu = -1;
+
+	/*
+	 * In the newly idle case, we will allow all the cpu's
+	 * to do the newly idle load balance.
+	 */
+	if (env->idle == CPU_NEWLY_IDLE)
+		return 1;
+
+	sg_cpus = sched_group_cpus(sg);
+	sg_mask = sched_group_mask(sg);
+	/* Try to find first idle cpu */
+	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
+		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+			continue;
+
+		balance_cpu = cpu;
+		break;
+	}
+
+	if (balance_cpu == -1)
+		balance_cpu = group_balance_cpu(sg);
+
+	/*
+	 * First idle cpu or the first cpu(busiest) in this sched group
+	 * is eligible for doing load balancing at this and above domains.
+	 */
+	return balance_cpu != env->dst_cpu;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
 static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *balance)
+			int *should_balance)
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
 	struct sched_group *group;
@@ -5123,12 +5120,11 @@ static int load_balance(int this_cpu, st
 
 	schedstat_inc(sd, lb_count[idle]);
 
-redo:
-	group = find_busiest_group(&env, balance);
-
-	if (*balance == 0)
+	if (!(*should_balance = should_we_balance(&env)))
 		goto out_balanced;
 
+redo:
+	group = find_busiest_group(&env);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
@@ -5340,7 +5336,7 @@ void idle_balance(int this_cpu, struct r
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
-		int balance = 1;
+		int should_balance;
 
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5348,7 +5344,8 @@ void idle_balance(int this_cpu, struct r
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 			/* If we've pulled tasks over stop searching: */
 			pulled_task = load_balance(this_cpu, this_rq,
-						   sd, CPU_NEWLY_IDLE, &balance);
+						   sd, CPU_NEWLY_IDLE,
+						   &should_balance);
 		}
 
 		interval = msecs_to_jiffies(sd->balance_interval);
@@ -5586,7 +5583,7 @@ void update_max_interval(void)
  */
 static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 {
-	int balance = 1;
+	int should_balance = 1;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
@@ -5618,7 +5615,7 @@ static void rebalance_domains(int cpu, e
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(cpu, rq, sd, idle, &balance)) {
+			if (load_balance(cpu, rq, sd, idle, &should_balance)) {
 				/*
 				 * The LBF_SOME_PINNED logic could have changed
 				 * env->dst_cpu, so we can't know our idle
@@ -5641,7 +5638,7 @@ static void rebalance_domains(int cpu, e
 		 * CPU in our sched group which is doing load balancing more
 		 * actively.
 		 */
-		if (!balance)
+		if (!should_balance)
 			break;
 	}
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 03/10] sched: Clean-up struct sd_lb_stat
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
  2013-08-19 16:00 ` [PATCH 01/10] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
  2013-08-19 16:01 ` [PATCH 02/10] sched: Factor out code to should_we_balance() Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-24 10:09   ` Paul Turner
  2013-08-26  2:56   ` Lei Wen
  2013-08-19 16:01 ` [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: joonsoo_kim-sched-clean-up_struct_sd_lb_stat.patch --]
[-- Type: text/plain, Size: 15654 bytes --]

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

There is no reason to maintain separate variables for this_group
and busiest_group in sd_lb_stat, except saving some space.
But this structure is always allocated in stack, so this saving
isn't really benificial [peterz: reducing stack space is good; in this
case readability increases enough that I think its still beneficial]

This patch unify these variables, so IMO, readability may be improved.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[peterz: lots of style edits, a few fixes and a rename]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com
---
 kernel/sched/fair.c |  225 +++++++++++++++++++++++++---------------------------
 1 file changed, 112 insertions(+), 113 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct
 
 /********** Helpers for find_busiest_group ************************/
 /*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
@@ -4314,6 +4284,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long sum_nr_running; /* Nr tasks running in the group */
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long idle_cpus;
 	unsigned long group_weight;
@@ -4321,6 +4292,21 @@ struct sg_lb_stats {
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		 during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *this;	/* Local group in this sd */
+	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_pwr;	/* Total power of all groups in sd */
+	unsigned long avg_load;	/* Average load across all groups in sd */
+
+	struct sg_lb_stats this_stat;	/* Statistics of this group */
+	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+};
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4533,10 +4519,11 @@ static inline void update_sg_lb_stats(st
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group)
+		if (local_group) {
 			load = target_load(i, load_idx);
-		else {
+		} else {
 			load = source_load(i, load_idx);
+
 			if (load > max_cpu_load)
 				max_cpu_load = load;
 			if (min_cpu_load > load)
@@ -4578,10 +4565,12 @@ static inline void update_sg_lb_stats(st
 	    (max_nr_running - min_nr_running) > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
-						SCHED_POWER_SCALE);
+	sgs->group_capacity =
+		DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE);
+
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(env->sd, group);
+
 	sgs->group_weight = group->group_weight;
 
 	if (sgs->group_capacity > sgs->sum_nr_running)
@@ -4606,7 +4595,7 @@ static bool update_sd_pick_busiest(struc
 				   struct sched_group *sg,
 				   struct sg_lb_stats *sgs)
 {
-	if (sgs->avg_load <= sds->max_load)
+	if (sgs->avg_load <= sds->busiest_stat.avg_load)
 		return false;
 
 	if (sgs->sum_nr_running > sgs->group_capacity)
@@ -4643,7 +4632,7 @@ static inline void update_sd_lb_stats(st
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
-	struct sg_lb_stats sgs;
+	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
@@ -4652,14 +4641,17 @@ static inline void update_sd_lb_stats(st
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
+		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
 
 		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
-		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
+		if (local_group) {
+			sds->this = sg;
+			sgs = &sds->this_stat;
+		}
 
-		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->sgp->power;
+		memset(sgs, 0, sizeof(*sgs));
+		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -4671,26 +4663,17 @@ static inline void update_sd_lb_stats(st
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
-			sgs.group_capacity = min(sgs.group_capacity, 1UL);
+		if (prefer_sibling && !local_group &&
+				sds->this && sds->this_stat.group_has_capacity)
+			sgs->group_capacity = min(sgs->group_capacity, 1UL);
 
-		if (local_group) {
-			sds->this_load = sgs.avg_load;
-			sds->this = sg;
-			sds->this_nr_running = sgs.sum_nr_running;
-			sds->this_load_per_task = sgs.sum_weighted_load;
-			sds->this_has_capacity = sgs.group_has_capacity;
-			sds->this_idle_cpus = sgs.idle_cpus;
-		} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
-			sds->max_load = sgs.avg_load;
+		/* Now, start updating sd_lb_stats */
+		sds->total_load += sgs->group_load;
+		sds->total_pwr += sg->sgp->power;
+
+		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
-			sds->busiest_nr_running = sgs.sum_nr_running;
-			sds->busiest_idle_cpus = sgs.idle_cpus;
-			sds->busiest_group_capacity = sgs.group_capacity;
-			sds->busiest_load_per_task = sgs.sum_weighted_load;
-			sds->busiest_has_capacity = sgs.group_has_capacity;
-			sds->busiest_group_weight = sgs.group_weight;
-			sds->group_imb = sgs.group_imb;
+			sds->busiest_stat = *sgs;
 		}
 
 		sg = sg->next;
@@ -4734,8 +4717,8 @@ static int check_asym_packing(struct lb_
 	if (env->dst_cpu > busiest_cpu)
 		return 0;
 
-	env->imbalance = DIV_ROUND_CLOSEST(
-		sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
+	env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
+				sds->busiest->sgp->power, SCHED_POWER_SCALE);
 
 	return 1;
 }
@@ -4753,24 +4736,23 @@ void fix_small_imbalance(struct lb_env *
 	unsigned long tmp, pwr_now = 0, pwr_move = 0;
 	unsigned int imbn = 2;
 	unsigned long scaled_busy_load_per_task;
+	struct sg_lb_stats *this, *busiest;
 
-	if (sds->this_nr_running) {
-		sds->this_load_per_task /= sds->this_nr_running;
-		if (sds->busiest_load_per_task >
-				sds->this_load_per_task)
-			imbn = 1;
-	} else {
-		sds->this_load_per_task =
-			cpu_avg_load_per_task(env->dst_cpu);
-	}
+	this = &sds->this_stat;
+	busiest = &sds->busiest_stat;
+
+	if (!this->sum_nr_running)
+		this->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
+	else if (busiest->load_per_task > this->load_per_task)
+		imbn = 1;
 
-	scaled_busy_load_per_task = sds->busiest_load_per_task
-					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->sgp->power;
+	scaled_busy_load_per_task =
+		(busiest->load_per_task * SCHED_POWER_SCALE) /
+		sds->busiest->sgp->power;
 
-	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
-			(scaled_busy_load_per_task * imbn)) {
-		env->imbalance = sds->busiest_load_per_task;
+	if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
+	    (scaled_busy_load_per_task * imbn)) {
+		env->imbalance = busiest->load_per_task;
 		return;
 	}
 
@@ -4781,33 +4763,36 @@ void fix_small_imbalance(struct lb_env *
 	 */
 
 	pwr_now += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load);
+			min(busiest->load_per_task, busiest->avg_load);
 	pwr_now += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load);
+			min(this->load_per_task, this->avg_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
+	tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
 		sds->busiest->sgp->power;
-	if (sds->max_load > tmp)
+	if (busiest->avg_load > tmp) {
 		pwr_move += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load - tmp);
+			    min(busiest->load_per_task,
+				busiest->avg_load - tmp);
+	}
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->sgp->power <
-		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->sgp->power) /
+	if (busiest->avg_load * sds->busiest->sgp->power <
+	    busiest->load_per_task * SCHED_POWER_SCALE) {
+		tmp = (busiest->avg_load * sds->busiest->sgp->power) /
 			sds->this->sgp->power;
-	else
-		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
+	} else {
+		tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
 			sds->this->sgp->power;
+	}
 	pwr_move += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load + tmp);
+			min(this->load_per_task, this->avg_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
-		env->imbalance = sds->busiest_load_per_task;
+		env->imbalance = busiest->load_per_task;
 }
 
 /**
@@ -4819,11 +4804,22 @@ void fix_small_imbalance(struct lb_env *
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	unsigned long max_pull, load_above_capacity = ~0UL;
+	struct sg_lb_stats *this, *busiest;
 
-	sds->busiest_load_per_task /= sds->busiest_nr_running;
-	if (sds->group_imb) {
-		sds->busiest_load_per_task =
-			min(sds->busiest_load_per_task, sds->avg_load);
+	this = &sds->this_stat;
+	if (this->sum_nr_running) {
+		this->load_per_task =
+			this->sum_weighted_load / this->sum_nr_running;
+	}
+
+	busiest = &sds->busiest_stat;
+	/* busiest must have some tasks */
+	busiest->load_per_task =
+		busiest->sum_weighted_load / busiest->sum_nr_running;
+
+	if (busiest->group_imb) {
+		busiest->load_per_task =
+			min(busiest->load_per_task, sds->avg_load);
 	}
 
 	/*
@@ -4831,20 +4827,19 @@ static inline void calculate_imbalance(s
 	 * max load less than avg load(as we skip the groups at or below
 	 * its cpu_power, while calculating max_load..)
 	 */
-	if (sds->max_load < sds->avg_load) {
+	if (busiest->avg_load < sds->avg_load) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
 
-	if (!sds->group_imb) {
+	if (!busiest->group_imb) {
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
 		 */
-		load_above_capacity = (sds->busiest_nr_running -
-						sds->busiest_group_capacity);
+		load_above_capacity =
+			(busiest->sum_nr_running - busiest->group_capacity);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
 		load_above_capacity /= sds->busiest->sgp->power;
 	}
 
@@ -4858,12 +4853,14 @@ static inline void calculate_imbalance(s
 	 * Be careful of negative numbers as they'll appear as very large values
 	 * with unsigned longs.
 	 */
-	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
+	max_pull = min(busiest->avg_load - sds->avg_load,
+		       load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	env->imbalance = min(max_pull * sds->busiest->sgp->power,
-		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
-			/ SCHED_POWER_SCALE;
+	env->imbalance = min(
+		max_pull * sds->busiest->sgp->power,
+		(sds->avg_load - this->avg_load) * sds->this->sgp->power
+	) / SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -4871,9 +4868,8 @@ static inline void calculate_imbalance(s
 	 * a think about bumping its value to force at least one task to be
 	 * moved
 	 */
-	if (env->imbalance < sds->busiest_load_per_task)
+	if (env->imbalance < busiest->load_per_task)
 		return fix_small_imbalance(env, sds);
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
@@ -4895,9 +4891,9 @@ static inline void calculate_imbalance(s
  *		   return the least loaded group whose CPUs can be
  *		   put to idle by rebalancing its tasks onto our group.
  */
-static struct sched_group *
-find_busiest_group(struct lb_env *env)
+static struct sched_group *find_busiest_group(struct lb_env *env)
 {
+	struct sg_lb_stats *this, *busiest;
 	struct sd_lb_stats sds;
 
 	memset(&sds, 0, sizeof(sds));
@@ -4907,13 +4903,15 @@ find_busiest_group(struct lb_env *env)
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
+	this = &sds.this_stat;
+	busiest = &sds.busiest_stat;
 
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || sds.busiest_nr_running == 0)
+	if (!sds.busiest || busiest->sum_nr_running == 0)
 		goto out_balanced;
 
 	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
@@ -4923,26 +4921,26 @@ find_busiest_group(struct lb_env *env)
 	 * work because they assumes all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
-	if (sds.group_imb)
+	if (busiest->group_imb)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
-			!sds.busiest_has_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && this->group_has_capacity &&
+	    !busiest->group_has_capacity)
 		goto force_balance;
 
 	/*
 	 * If the local group is more busy than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (sds.this_load >= sds.max_load)
+	if (this->avg_load >= busiest->avg_load)
 		goto out_balanced;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	if (sds.this_load >= sds.avg_load)
+	if (this->avg_load >= sds.avg_load)
 		goto out_balanced;
 
 	if (env->idle == CPU_IDLE) {
@@ -4952,15 +4950,16 @@ find_busiest_group(struct lb_env *env)
 		 * there is no imbalance between this and busiest group
 		 * wrt to idle cpu's, it is balanced.
 		 */
-		if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
-		    sds.busiest_nr_running <= sds.busiest_group_weight)
+		if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
+		    busiest->sum_nr_running <= busiest->group_weight)
 			goto out_balanced;
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 		 * imbalance_pct to be conservative.
 		 */
-		if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
+		if (100 * busiest->avg_load <=
+				env->sd->imbalance_pct * this->avg_load)
 			goto out_balanced;
 	}
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 03/10] sched: Clean-up struct sd_lb_stat Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-21  2:08   ` Joonsoo Kim
  2013-08-24 10:15   ` Paul Turner
  2013-08-19 16:01 ` [PATCH 05/10] sched, fair: Remove duplicate load_per_task computations Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-more-bfg-cleanups-4.patch --]
[-- Type: text/plain, Size: 3879 bytes --]

We can shrink sg_lb_stats because rq::nr_running is an 'unsigned int'
and cpu numbers are 'int'

Before:
  sgs:        /* size: 72, cachelines: 2, members: 10 */
  sds:        /* size: 184, cachelines: 3, members: 7 */

After:
  sgs:        /* size: 56, cachelines: 1, members: 10 */
  sds:        /* size: 152, cachelines: 3, members: 7 */

Further we can avoid clearing all of sds since we do a total
clear/assignment of sg_stats in update_sg_lb_stats() with exception of
busiest_stat.avg_load which is referenced in update_sd_pick_busiest().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4282,12 +4282,12 @@ static unsigned long task_h_load(struct
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
+	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int group_capacity;
+	unsigned int idle_cpus;
+	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
@@ -4303,10 +4303,38 @@ struct sd_lb_stats {
 	unsigned long total_pwr;	/* Total power of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
-	struct sg_lb_stats this_stat;	/* Statistics of this group */
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+	struct sg_lb_stats this_stat;	/* Statistics of this group */
 };
 
+static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
+{
+	/*
+	 * struct sd_lb_stats {
+	 *	   struct sched_group *       busiest;             //     0  8
+	 *	   struct sched_group *       this;                //     8  8
+	 *	   long unsigned int          total_load;          //    16  8
+	 *	   long unsigned int          total_pwr;           //    24  8
+	 *	   long unsigned int          avg_load;            //    32  8
+	 *	   struct sg_lb_stats {
+	 *		   long unsigned int  avg_load;            //    40  8
+	 *		   long unsigned int  group_load;          //    48  8
+	 *	           ...
+	 *	   } busiest_stat;                                 //    40 56
+	 *	   struct sg_lb_stats	      this_stat;	   //    96 56
+	 *
+	 *	   // size: 152, cachelines: 3, members: 7
+	 *	   // last cacheline: 24 bytes
+	 * };
+	 *
+	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
+	 * this_stat because update_sg_lb_stats() does a full clear/assignment.
+	 * We must however clear busiest_stat::avg_load because
+	 * update_sd_pick_busiest() reads this before assignment.
+	 */
+	memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4665,7 +4693,7 @@ static inline void update_sd_lb_stats(st
 		 */
 		if (prefer_sibling && !local_group &&
 				sds->this && sds->this_stat.group_has_capacity)
-			sgs->group_capacity = min(sgs->group_capacity, 1UL);
+			sgs->group_capacity = min(sgs->group_capacity, 1U);
 
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
@@ -4896,7 +4924,7 @@ static struct sched_group *find_busiest_
 	struct sg_lb_stats *this, *busiest;
 	struct sd_lb_stats sds;
 
-	memset(&sds, 0, sizeof(sds));
+	init_sd_lb_stats(&sds);
 
 	/*
 	 * Compute the various statistics relavent for load balancing at



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 05/10] sched, fair: Remove duplicate load_per_task computations
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-19 16:01 ` [PATCH 06/10] sched, fair: Make group power more consitent Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-more-fbg-cleanups.patch --]
[-- Type: text/plain, Size: 1542 bytes --]

Since we already compute (but don't store) the sgs load_per_task value
in update_sg_lb_stats() we might as well store it and not re-compute
it later on.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4504,7 +4504,6 @@ static inline void update_sg_lb_stats(st
 {
 	unsigned long nr_running, max_nr_running, min_nr_running;
 	unsigned long load, max_cpu_load, min_cpu_load;
-	unsigned long avg_load_per_task = 0;
 	int i;
 
 	/* Tally up the load of all CPUs in the group */
@@ -4559,9 +4558,9 @@ static inline void update_sg_lb_stats(st
 	 *      the hierarchy?
 	 */
 	if (sgs->sum_nr_running)
-		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
+	if ((max_cpu_load - min_cpu_load) >= sgs->load_per_task &&
 	    (max_nr_running - min_nr_running) > 1)
 		sgs->group_imb = 1;
 
@@ -4807,15 +4806,7 @@ static inline void calculate_imbalance(s
 	struct sg_lb_stats *this, *busiest;
 
 	this = &sds->this_stat;
-	if (this->sum_nr_running) {
-		this->load_per_task =
-			this->sum_weighted_load / this->sum_nr_running;
-	}
-
 	busiest = &sds->busiest_stat;
-	/* busiest must have some tasks */
-	busiest->load_per_task =
-		busiest->sum_weighted_load / busiest->sum_nr_running;
 
 	if (busiest->group_imb) {
 		busiest->load_per_task =



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 06/10] sched, fair: Make group power more consitent
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 05/10] sched, fair: Remove duplicate load_per_task computations Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-23  3:40   ` Preeti U Murthy
  2013-08-19 16:01 ` [PATCH 07/10] sched, fair: Optimize find_busiest_queue() Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-more-bfg-cleanups-2.patch --]
[-- Type: text/plain, Size: 5436 bytes --]

For easier access, less dereferences and more consisent value, store
the group power in update_sg_lb_stats() and use it thereafter. The
actual value in sched_group::sched_group_power::power can change
throughout the load-balance pass if we're unlucky.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4284,6 +4284,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
+	unsigned long group_power;
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity;
 	unsigned int idle_cpus;
@@ -4320,11 +4321,11 @@ static inline void init_sd_lb_stats(stru
 	 *		   long unsigned int  avg_load;            //    40  8
 	 *		   long unsigned int  group_load;          //    48  8
 	 *	           ...
-	 *	   } busiest_stat;                                 //    40 56
-	 *	   struct sg_lb_stats	      this_stat;	   //    96 56
+	 *	   } busiest_stat;                                 //    40 64
+	 *	   struct sg_lb_stats	      this_stat;	   //   104 64
 	 *
-	 *	   // size: 152, cachelines: 3, members: 7
-	 *	   // last cacheline: 24 bytes
+	 *	   // size: 168, cachelines: 3, members: 7
+	 *	   // last cacheline: 40 bytes
 	 * };
 	 *
 	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
@@ -4574,7 +4575,8 @@ static inline void update_sg_lb_stats(st
 		update_group_power(env->sd, env->dst_cpu);
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
+	sgs->group_power = group->sgp->power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -4593,7 +4595,7 @@ static inline void update_sg_lb_stats(st
 		sgs->group_imb = 1;
 
 	sgs->group_capacity =
-		DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE);
+		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
 
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(env->sd, group);
@@ -4696,7 +4698,7 @@ static inline void update_sd_lb_stats(st
 
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
-		sds->total_pwr += sg->sgp->power;
+		sds->total_pwr += sgs->group_power;
 
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -4744,8 +4746,9 @@ static int check_asym_packing(struct lb_
 	if (env->dst_cpu > busiest_cpu)
 		return 0;
 
-	env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
-				sds->busiest->sgp->power, SCHED_POWER_SCALE);
+	env->imbalance = DIV_ROUND_CLOSEST(
+		sds->busiest_stat.avg_load * sds->busiest_stat.group_power,
+		SCHED_POWER_SCALE);
 
 	return 1;
 }
@@ -4775,7 +4778,7 @@ void fix_small_imbalance(struct lb_env *
 
 	scaled_busy_load_per_task =
 		(busiest->load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->sgp->power;
+		busiest->group_power;
 
 	if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
 	    (scaled_busy_load_per_task * imbn)) {
@@ -4789,32 +4792,32 @@ void fix_small_imbalance(struct lb_env *
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->sgp->power *
+	pwr_now += busiest->group_power *
 			min(busiest->load_per_task, busiest->avg_load);
-	pwr_now += sds->this->sgp->power *
+	pwr_now += this->group_power *
 			min(this->load_per_task, this->avg_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
 	tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->sgp->power;
+		busiest->group_power;
 	if (busiest->avg_load > tmp) {
-		pwr_move += sds->busiest->sgp->power *
+		pwr_move += busiest->group_power *
 			    min(busiest->load_per_task,
 				busiest->avg_load - tmp);
 	}
 
 	/* Amount of load we'd add */
-	if (busiest->avg_load * sds->busiest->sgp->power <
+	if (busiest->avg_load * busiest->group_power <
 	    busiest->load_per_task * SCHED_POWER_SCALE) {
-		tmp = (busiest->avg_load * sds->busiest->sgp->power) /
-			sds->this->sgp->power;
+		tmp = (busiest->avg_load * busiest->group_power) /
+		      this->group_power;
 	} else {
 		tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
-			sds->this->sgp->power;
+		      this->group_power;
 	}
-	pwr_move += sds->this->sgp->power *
-			min(this->load_per_task, this->avg_load + tmp);
+	pwr_move += this->group_power *
+		    min(this->load_per_task, this->avg_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
@@ -4859,7 +4862,7 @@ static inline void calculate_imbalance(s
 			(busiest->sum_nr_running - busiest->group_capacity);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-		load_above_capacity /= sds->busiest->sgp->power;
+		load_above_capacity /= busiest->group_power;
 	}
 
 	/*
@@ -4877,8 +4880,8 @@ static inline void calculate_imbalance(s
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(
-		max_pull * sds->busiest->sgp->power,
-		(sds->avg_load - this->avg_load) * sds->this->sgp->power
+		max_pull * busiest->group_power,
+		(sds->avg_load - this->avg_load) * this->group_power
 	) / SCHED_POWER_SCALE;
 
 	/*



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 06/10] sched, fair: Make group power more consitent Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-23  8:11   ` Preeti U Murthy
  2013-08-24 10:33   ` Paul Turner
  2013-08-19 16:01 ` [PATCH 08/10] sched, fair: Rework and comment the group_imb code Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-more-bfg-cleanups-3.patch --]
[-- Type: text/plain, Size: 911 bytes --]

Use for_each_cpu_and() and thereby avoid computing the capacity for
CPUs we know we're not interested in.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4977,7 +4977,7 @@ static struct rq *find_busiest_queue(str
 	unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
 	int i;
 
-	for_each_cpu(i, sched_group_cpus(group)) {
+	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		unsigned long power = power_of(i);
 		unsigned long capacity = DIV_ROUND_CLOSEST(power,
 							   SCHED_POWER_SCALE);
@@ -4986,9 +4986,6 @@ static struct rq *find_busiest_queue(str
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		if (!cpumask_test_cpu(i, env->cpus))
-			continue;
-
 		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 08/10] sched, fair: Rework and comment the group_imb code
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 07/10] sched, fair: Optimize find_busiest_queue() Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-19 16:01 ` [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel, Peter Zijlstra

[-- Attachment #1: peterz-group_imb-frubbery.patch --]
[-- Type: text/plain, Size: 6882 bytes --]

Rik reported some weirdness due to the group_imb code. As a start to
looking at it, clean it up a little and add a few explanatory
comments.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |  123 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 89 insertions(+), 34 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4491,6 +4491,81 @@ fix_small_capacity(struct sched_domain *
 	return 0;
 }
 
+/*
+ * Group imbalance indicates (and tries to solve) the problem where balancing
+ * groups is inadequate due to tsk_cpus_allowed() constraints.
+ *
+ * Imagine a situation of two groups of 4 cpus each and 4 tasks each with a
+ * cpumask covering 1 cpu of the first group and 3 cpus of the second group.
+ * Something like:
+ *
+ * 	{ 0 1 2 3 } { 4 5 6 7 }
+ * 	        *     * * *
+ *
+ * If we were to balance group-wise we'd place two tasks in the first group and
+ * two tasks in the second group. Clearly this is undesired as it will overload
+ * cpu 3 and leave one of the cpus in the second group unused.
+ *
+ * The current solution to this issue is detecting the skew in the first group
+ * by noticing it has a cpu that is overloaded while the remaining cpus are
+ * idle -- or rather, there's a distinct imbalance in the cpus; see
+ * sg_imbalanced().
+ *
+ * When this is so detected; this group becomes a candidate for busiest; see
+ * update_sd_pick_busiest(). And calculcate_imbalance() and
+ * find_busiest_group() avoid some of the usual balance conditional to allow it
+ * to create an effective group imbalance.
+ *
+ * This is a somewhat tricky proposition since the next run might not find the
+ * group imbalance and decide the groups need to be balanced again. A most
+ * subtle and fragile situation.
+ */
+
+struct sg_imb_stats {
+	unsigned long max_nr_running, min_nr_running;
+	unsigned long max_cpu_load, min_cpu_load;
+};
+
+static inline void init_sg_imb_stats(struct sg_imb_stats *sgi)
+{
+	sgi->max_cpu_load = sgi->max_nr_running = 0UL;
+	sgi->min_cpu_load = sgi->min_nr_running = ~0UL;
+}
+
+static inline void
+update_sg_imb_stats(struct sg_imb_stats *sgi,
+		    unsigned long load, unsigned long nr_running)
+{
+	if (load > sgi->max_cpu_load)
+		sgi->max_cpu_load = load;
+	if (sgi->min_cpu_load > load)
+		sgi->min_cpu_load = load;
+
+	if (nr_running > sgi->max_nr_running)
+		sgi->max_nr_running = nr_running;
+	if (sgi->min_nr_running > nr_running)
+		sgi->min_nr_running = nr_running;
+}
+
+static inline int
+sg_imbalanced(struct sg_lb_stats *sgs, struct sg_imb_stats *sgi)
+{
+	/*
+	 * Consider the group unbalanced when the imbalance is larger
+	 * than the average weight of a task.
+	 *
+	 * APZ: with cgroup the avg task weight can vary wildly and
+	 *      might not be a suitable number - should we keep a
+	 *      normalized nr_running number somewhere that negates
+	 *      the hierarchy?
+	 */
+	if ((sgi->max_cpu_load - sgi->min_cpu_load) >= sgs->load_per_task &&
+	    (sgi->max_nr_running - sgi->min_nr_running) > 1)
+		return 1;
+
+	return 0;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -4503,15 +4578,12 @@ static inline void update_sg_lb_stats(st
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs)
 {
-	unsigned long nr_running, max_nr_running, min_nr_running;
-	unsigned long load, max_cpu_load, min_cpu_load;
+	struct sg_imb_stats sgi;
+	unsigned long nr_running;
+	unsigned long load;
 	int i;
 
-	/* Tally up the load of all CPUs in the group */
-	max_cpu_load = 0;
-	min_cpu_load = ~0UL;
-	max_nr_running = 0;
-	min_nr_running = ~0UL;
+	init_sg_imb_stats(&sgi);
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -4523,16 +4595,7 @@ static inline void update_sg_lb_stats(st
 			load = target_load(i, load_idx);
 		} else {
 			load = source_load(i, load_idx);
-
-			if (load > max_cpu_load)
-				max_cpu_load = load;
-			if (min_cpu_load > load)
-				min_cpu_load = load;
-
-			if (nr_running > max_nr_running)
-				max_nr_running = nr_running;
-			if (min_nr_running > nr_running)
-				min_nr_running = nr_running;
+			update_sg_imb_stats(&sgi, load, nr_running);
 		}
 
 		sgs->group_load += load;
@@ -4550,21 +4613,10 @@ static inline void update_sg_lb_stats(st
 	sgs->group_power = group->sgp->power;
 	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
 
-	/*
-	 * Consider the group unbalanced when the imbalance is larger
-	 * than the average weight of a task.
-	 *
-	 * APZ: with cgroup the avg task weight can vary wildly and
-	 *      might not be a suitable number - should we keep a
-	 *      normalized nr_running number somewhere that negates
-	 *      the hierarchy?
-	 */
 	if (sgs->sum_nr_running)
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) >= sgs->load_per_task &&
-	    (max_nr_running - min_nr_running) > 1)
-		sgs->group_imb = 1;
+	sgs->group_imb = sg_imbalanced(sgs, &sgi);
 
 	sgs->group_capacity =
 		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
@@ -4812,6 +4864,10 @@ static inline void calculate_imbalance(s
 	busiest = &sds->busiest_stat;
 
 	if (busiest->group_imb) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure cpu-load equilibrium, look at wider averages. XXX
+		 */
 		busiest->load_per_task =
 			min(busiest->load_per_task, sds->avg_load);
 	}
@@ -4829,6 +4885,8 @@ static inline void calculate_imbalance(s
 	if (!busiest->group_imb) {
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
+		 * Except of course for the group_imb case, since then we might
+		 * have to drop below capacity to reach cpu-load equilibrium.
 		 */
 		load_above_capacity =
 			(busiest->sum_nr_running - busiest->group_capacity);
@@ -4844,11 +4902,8 @@ static inline void calculate_imbalance(s
 	 * we also don't want to reduce the group load below the group capacity
 	 * (so that we can implement power-savings policies etc). Thus we look
 	 * for the minimum possible imbalance.
-	 * Be careful of negative numbers as they'll appear as very large values
-	 * with unsigned longs.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load,
-		       load_above_capacity);
+	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(
@@ -4912,7 +4967,7 @@ static struct sched_group *find_busiest_
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
-	 * work because they assumes all things are equal, which typically
+	 * work because they assume all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
 	if (busiest->group_imb)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 08/10] sched, fair: Rework and comment the group_imb code Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-24 10:45   ` Paul Turner
  2013-08-19 16:01 ` [RFC][PATCH 10/10] sched, fair: Rewrite group_imb trigger Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim, Peter Zijlstra

[-- Attachment #1: peterz-fix-sd_degenerate.patch --]
[-- Type: text/plain, Size: 2587 bytes --]

I found that on my wsm box I had a redundant domain:

[    0.949769] CPU0 attaching sched-domain:
[    0.953765]  domain 0: span 0,12 level SIBLING
[    0.958335]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[    0.964548]   domain 1: span 0-5,12-17 level MC
[    0.969206]    groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[    0.984993]    domain 2: span 0-5,12-17 level CPU
[    0.989822]     groups: 0-5,12-17 (cpu_power = 7055)
[    0.995049]     domain 3: span 0-23 level NUMA
[    0.999620]      groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)

Note how domain 2 has only a single group and spans the same CPUs as
domain 1. We should not keep such domains and do in fact have code to
prune these.

It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
makes sd_parent_degenerate() fail on the CPU domain. We can easily fix
this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it to
whatever domain ends up covering the span.

With this patch the domains now look like this:

[    0.950419] CPU0 attaching sched-domain:
[    0.954454]  domain 0: span 0,12 level SIBLING
[    0.959039]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[    0.965271]   domain 1: span 0-5,12-17 level MC
[    0.969936]    groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[    0.985737]    domain 2: span 0-23 level NUMA
[    0.990231]     groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4948,7 +4948,8 @@ sd_parent_degenerate(struct sched_domain
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
 				SD_SHARE_CPUPOWER |
-				SD_SHARE_PKG_RESOURCES);
+				SD_SHARE_PKG_RESOURCES |
+				SD_PREFER_SIBLING);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
 			tmp->parent = parent->parent;
 			if (parent->parent)
 				parent->parent->child = tmp;
+			/*
+			 * Transfer SD_PREFER_SIBLING down in case of a
+			 * degenerate parent; the spans match for this
+			 * so the property transfers.
+			 */
+			if (parent->flags & SD_PREFER_SIBLING)
+				tmp->flags |= SD_PREFER_SIBLING;
 			destroy_sched_domain(parent, cpu);
 		} else
 			tmp = tmp->parent;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 10/10] sched, fair: Rewrite group_imb trigger
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2013-08-19 16:01 ` [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code Peter Zijlstra
@ 2013-08-19 16:01 ` Peter Zijlstra
  2013-08-21  2:09 ` [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Joonsoo Kim
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-19 16:01 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel, Peter Zijlstra

[-- Attachment #1: peterz-group_imb-fixes.patch --]
[-- Type: text/plain, Size: 8011 bytes --]

Change the group_imb detection from the old 'load-spike' detector to
an actual imbalance detector. We set it from the lower domain balance
pass when it fails to create a balance in the presence of task
affinities.

The advantage is that this should no longer generate the false
positive group_imb conditions generated by transient load spikes from
the normal balancing/bulk-wakeup etc. behaviour.

While I haven't actually observed those they could happen.

I'm not entirely happy with this patch; it somehow feels a little
fragile.

Nor does it solve the biggest issue I have with the group_imb code; it
it still a fragile construct in that once we 'fixed' the imbalance
we'll not detect the group_imb again and could end up re-creating it.

That said, this patch does seem to preserve behaviour for the
described degenerate case. In particular on my 2*6*2 wsm-ep:

  taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'

ends up with 9 spinners, each on their own CPU; whereas if you disable
the group_imb code that typically doesn't happen (you'll get one pair
sharing a CPU most of the time).

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c  |   90 +++++++++++++++++----------------------------------
 kernel/sched/sched.h |    1 
 2 files changed, 31 insertions(+), 60 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3906,7 +3906,8 @@ static unsigned long __read_mostly max_l
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_DST_PINNED  0x04
+#define LBF_SOME_PINNED	0x08
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3997,6 +3998,8 @@ int can_migrate_task(struct task_struct
 
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 
+		env->flags |= LBF_SOME_PINNED;
+
 		/*
 		 * Remember if this task can be migrated to any other cpu in
 		 * our sched_group. We may want to revisit it if we couldn't
@@ -4005,13 +4008,13 @@ int can_migrate_task(struct task_struct
 		 * Also avoid computing new_dst_cpu if we have already computed
 		 * one in current iteration.
 		 */
-		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
 			return 0;
 
 		/* Prevent to re-select dst_cpu via env's cpus */
 		for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
 			if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
-				env->flags |= LBF_SOME_PINNED;
+				env->flags |= LBF_DST_PINNED;
 				env->new_dst_cpu = cpu;
 				break;
 			}
@@ -4535,13 +4538,12 @@ fix_small_capacity(struct sched_domain *
  * cpu 3 and leave one of the cpus in the second group unused.
  *
  * The current solution to this issue is detecting the skew in the first group
- * by noticing it has a cpu that is overloaded while the remaining cpus are
- * idle -- or rather, there's a distinct imbalance in the cpus; see
- * sg_imbalanced().
+ * by noticing the lower domain failed to reach balance and had difficulty
+ * moving tasks due to affinity constraints.
  *
  * When this is so detected; this group becomes a candidate for busiest; see
  * update_sd_pick_busiest(). And calculcate_imbalance() and
- * find_busiest_group() avoid some of the usual balance conditional to allow it
+ * find_busiest_group() avoid some of the usual balance conditions to allow it
  * to create an effective group imbalance.
  *
  * This is a somewhat tricky proposition since the next run might not find the
@@ -4549,49 +4551,9 @@ fix_small_capacity(struct sched_domain *
  * subtle and fragile situation.
  */
 
-struct sg_imb_stats {
-	unsigned long max_nr_running, min_nr_running;
-	unsigned long max_cpu_load, min_cpu_load;
-};
-
-static inline void init_sg_imb_stats(struct sg_imb_stats *sgi)
+static inline int sg_imbalanced(struct sched_group *group)
 {
-	sgi->max_cpu_load = sgi->max_nr_running = 0UL;
-	sgi->min_cpu_load = sgi->min_nr_running = ~0UL;
-}
-
-static inline void
-update_sg_imb_stats(struct sg_imb_stats *sgi,
-		    unsigned long load, unsigned long nr_running)
-{
-	if (load > sgi->max_cpu_load)
-		sgi->max_cpu_load = load;
-	if (sgi->min_cpu_load > load)
-		sgi->min_cpu_load = load;
-
-	if (nr_running > sgi->max_nr_running)
-		sgi->max_nr_running = nr_running;
-	if (sgi->min_nr_running > nr_running)
-		sgi->min_nr_running = nr_running;
-}
-
-static inline int
-sg_imbalanced(struct sg_lb_stats *sgs, struct sg_imb_stats *sgi)
-{
-	/*
-	 * Consider the group unbalanced when the imbalance is larger
-	 * than the average weight of a task.
-	 *
-	 * APZ: with cgroup the avg task weight can vary wildly and
-	 *      might not be a suitable number - should we keep a
-	 *      normalized nr_running number somewhere that negates
-	 *      the hierarchy?
-	 */
-	if ((sgi->max_cpu_load - sgi->min_cpu_load) >= sgs->load_per_task &&
-	    (sgi->max_nr_running - sgi->min_nr_running) > 1)
-		return 1;
-
-	return 0;
+	return group->sgp->imbalance;
 }
 
 /**
@@ -4606,25 +4568,20 @@ static inline void update_sg_lb_stats(st
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs)
 {
-	struct sg_imb_stats sgi;
 	unsigned long nr_running;
 	unsigned long load;
 	int i;
 
-	init_sg_imb_stats(&sgi);
-
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group) {
+		if (local_group)
 			load = target_load(i, load_idx);
-		} else {
+		else
 			load = source_load(i, load_idx);
-			update_sg_imb_stats(&sgi, load, nr_running);
-		}
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
@@ -4644,7 +4601,7 @@ static inline void update_sg_lb_stats(st
 	if (sgs->sum_nr_running)
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	sgs->group_imb = sg_imbalanced(sgs, &sgi);
+	sgs->group_imb = sg_imbalanced(group);
 
 	sgs->group_capacity =
 		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
@@ -5167,6 +5124,7 @@ static int load_balance(int this_cpu, st
 			int *should_balance)
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
+	struct sched_domain *sd_parent = sd->parent;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -5269,11 +5227,11 @@ static int load_balance(int this_cpu, st
 		 * moreover subsequent load balance cycles should correct the
 		 * excess load moved.
 		 */
-		if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+		if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {
 
 			env.dst_rq	 = cpu_rq(env.new_dst_cpu);
 			env.dst_cpu	 = env.new_dst_cpu;
-			env.flags	&= ~LBF_SOME_PINNED;
+			env.flags	&= ~LBF_DST_PINNED;
 			env.loop	 = 0;
 			env.loop_break	 = sched_nr_migrate_break;
 
@@ -5287,6 +5245,18 @@ static int load_balance(int this_cpu, st
 			goto more_balance;
 		}
 
+		/*
+		 * We failed to reach balance because of affinity.
+		 */
+		if (sd_parent) {
+			int *group_imbalance = &sd_parent->groups->sgp->imbalance;
+
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+				*group_imbalance = 1;
+			} else if (*group_imbalance)
+				*group_imbalance = 0;
+		}
+
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
@@ -5690,7 +5660,7 @@ static void rebalance_domains(int cpu, e
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (load_balance(cpu, rq, sd, idle, &should_balance)) {
 				/*
-				 * The LBF_SOME_PINNED logic could have changed
+				 * The LBF_DST_PINNED logic could have changed
 				 * env->dst_cpu, so we can't know our idle
 				 * state even if we migrated tasks. Update it.
 				 */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -605,6 +605,7 @@ struct sched_group_power {
 	 */
 	unsigned int power, power_orig;
 	unsigned long next_update;
+	int imbalance; /* XXX unrelated to power but shared group state */
 	/*
 	 * Number of busy cpus in this group.
 	 */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-19 16:01 ` [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games Peter Zijlstra
@ 2013-08-21  2:08   ` Joonsoo Kim
  2013-08-21  2:20     ` Joonsoo Kim
  2013-08-21  8:35     ` Peter Zijlstra
  2013-08-24 10:15   ` Paul Turner
  1 sibling, 2 replies; 47+ messages in thread
From: Joonsoo Kim @ 2013-08-21  2:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel

On Mon, Aug 19, 2013 at 06:01:02PM +0200, Peter Zijlstra wrote:
> +static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> +{
> +	/*
> +	 * struct sd_lb_stats {
> +	 *	   struct sched_group *       busiest;             //     0  8
> +	 *	   struct sched_group *       this;                //     8  8
> +	 *	   long unsigned int          total_load;          //    16  8
> +	 *	   long unsigned int          total_pwr;           //    24  8
> +	 *	   long unsigned int          avg_load;            //    32  8
> +	 *	   struct sg_lb_stats {
> +	 *		   long unsigned int  avg_load;            //    40  8
> +	 *		   long unsigned int  group_load;          //    48  8
> +	 *	           ...
> +	 *	   } busiest_stat;                                 //    40 56
> +	 *	   struct sg_lb_stats	      this_stat;	   //    96 56
> +	 *
> +	 *	   // size: 152, cachelines: 3, members: 7
> +	 *	   // last cacheline: 24 bytes
> +	 * };
> +	 *
> +	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
> +	 * this_stat because update_sg_lb_stats() does a full clear/assignment.
> +	 * We must however clear busiest_stat::avg_load because
> +	 * update_sd_pick_busiest() reads this before assignment.
> +	 */
> +	memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));

Hello, Peter.

IMHO, this is so tricky.
With below change, we can simply do 'offsetof(struct sd_lb_stats, busiest_stat)'.

@@ -4546,7 +4546,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
                                   struct sched_group *sg,
                                   struct sg_lb_stats *sgs)
 {
-       if (sgs->avg_load <= sds->busiest_stat.avg_load)
+       if (!sds->busiest && sgs->avg_load <= sds->busiest_stat.avg_load)
                return false;
 
        if (sgs->sum_nr_running > sgs->group_capacity)

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] Various load-balance cleanups/optimizations -v2
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2013-08-19 16:01 ` [RFC][PATCH 10/10] sched, fair: Rewrite group_imb trigger Peter Zijlstra
@ 2013-08-21  2:09 ` Joonsoo Kim
  2013-08-28  8:55 ` [RFC][PATCH 11/10] sched, fair: Reduce local_group logic Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: Joonsoo Kim @ 2013-08-21  2:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel

On Mon, Aug 19, 2013 at 06:00:58PM +0200, Peter Zijlstra wrote:
> 
> After poking at them a little more I feel somewhat more confident.
> 
> I found one more bug, but this one was my own fault, we should also clear
> sds->busiest_stat.avg_load because update_sd_pick_busiest() reads that before
> we set it.
> 
> Moved the memset optimization and the fix for that into a separate patch.
> 
> Other than that, there's a sched_domain degenerate fix; and a new way to detect
> group_imb (which relies on the sched_domain fix).

I'm fine to my patches with your changes.

Thanks!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-21  2:08   ` Joonsoo Kim
@ 2013-08-21  2:20     ` Joonsoo Kim
  2013-08-21  8:38       ` Peter Zijlstra
  2013-08-21  8:35     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Joonsoo Kim @ 2013-08-21  2:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel

On Wed, Aug 21, 2013 at 11:08:29AM +0900, Joonsoo Kim wrote:
> On Mon, Aug 19, 2013 at 06:01:02PM +0200, Peter Zijlstra wrote:
> > +static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> > +{
> > +	/*
> > +	 * struct sd_lb_stats {
> > +	 *	   struct sched_group *       busiest;             //     0  8
> > +	 *	   struct sched_group *       this;                //     8  8
> > +	 *	   long unsigned int          total_load;          //    16  8
> > +	 *	   long unsigned int          total_pwr;           //    24  8
> > +	 *	   long unsigned int          avg_load;            //    32  8
> > +	 *	   struct sg_lb_stats {
> > +	 *		   long unsigned int  avg_load;            //    40  8
> > +	 *		   long unsigned int  group_load;          //    48  8
> > +	 *	           ...
> > +	 *	   } busiest_stat;                                 //    40 56
> > +	 *	   struct sg_lb_stats	      this_stat;	   //    96 56
> > +	 *
> > +	 *	   // size: 152, cachelines: 3, members: 7
> > +	 *	   // last cacheline: 24 bytes
> > +	 * };
> > +	 *
> > +	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
> > +	 * this_stat because update_sg_lb_stats() does a full clear/assignment.
> > +	 * We must however clear busiest_stat::avg_load because
> > +	 * update_sd_pick_busiest() reads this before assignment.
> > +	 */
> > +	memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
> 
> Hello, Peter.
> 
> IMHO, this is so tricky.
> With below change, we can simply do 'offsetof(struct sd_lb_stats, busiest_stat)'.
> 
> @@ -4546,7 +4546,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                                    struct sched_group *sg,
>                                    struct sg_lb_stats *sgs)
>  {
> -       if (sgs->avg_load <= sds->busiest_stat.avg_load)
> +       if (!sds->busiest && sgs->avg_load <= sds->busiest_stat.avg_load)
>                 return false;
>  
>         if (sgs->sum_nr_running > sgs->group_capacity)
> 

Sorry, instead of !sds->busiest, it should be sds->busiest. :)

Thanks.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-21  2:08   ` Joonsoo Kim
  2013-08-21  2:20     ` Joonsoo Kim
@ 2013-08-21  8:35     ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-21  8:35 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel

On Wed, Aug 21, 2013 at 11:08:29AM +0900, Joonsoo Kim wrote:
> On Mon, Aug 19, 2013 at 06:01:02PM +0200, Peter Zijlstra wrote:
> > +static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> > +{
> > +	/*
> > +	 * struct sd_lb_stats {
> > +	 *	   struct sched_group *       busiest;             //     0  8
> > +	 *	   struct sched_group *       this;                //     8  8
> > +	 *	   long unsigned int          total_load;          //    16  8
> > +	 *	   long unsigned int          total_pwr;           //    24  8
> > +	 *	   long unsigned int          avg_load;            //    32  8
> > +	 *	   struct sg_lb_stats {
> > +	 *		   long unsigned int  avg_load;            //    40  8
> > +	 *		   long unsigned int  group_load;          //    48  8
> > +	 *	           ...
> > +	 *	   } busiest_stat;                                 //    40 56
> > +	 *	   struct sg_lb_stats	      this_stat;	   //    96 56
> > +	 *
> > +	 *	   // size: 152, cachelines: 3, members: 7
> > +	 *	   // last cacheline: 24 bytes
> > +	 * };
> > +	 *
> > +	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
> > +	 * this_stat because update_sg_lb_stats() does a full clear/assignment.
> > +	 * We must however clear busiest_stat::avg_load because
> > +	 * update_sd_pick_busiest() reads this before assignment.
> > +	 */
> > +	memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
> 
> Hello, Peter.
> 
> IMHO, this is so tricky.

Agreed, hence the somewhat elaborate comment :/

> With below change, we can simply do 'offsetof(struct sd_lb_stats, busiest_stat)'.
> 
> @@ -4546,7 +4546,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                                    struct sched_group *sg,
>                                    struct sg_lb_stats *sgs)
>  {
> -       if (sgs->avg_load <= sds->busiest_stat.avg_load)
> +       if (!sds->busiest && sgs->avg_load <= sds->busiest_stat.avg_load)
>                 return false;
>  
>         if (sgs->sum_nr_running > sgs->group_capacity)

Indeed, thanks!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-21  2:20     ` Joonsoo Kim
@ 2013-08-21  8:38       ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-21  8:38 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel

On Wed, Aug 21, 2013 at 11:20:02AM +0900, Joonsoo Kim wrote:
> > With below change, we can simply do 'offsetof(struct sd_lb_stats, busiest_stat)'.
> > 
> > @@ -4546,7 +4546,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >                                    struct sched_group *sg,
> >                                    struct sg_lb_stats *sgs)
> >  {
> > -       if (sgs->avg_load <= sds->busiest_stat.avg_load)
> > +       if (!sds->busiest && sgs->avg_load <= sds->busiest_stat.avg_load)
> >                 return false;
> >  
> >         if (sgs->sum_nr_running > sgs->group_capacity)
> > 
> 
> Sorry, instead of !sds->busiest, it should be sds->busiest. :)

Of course ;-) However I'm not sure which I prefer.. adding this extra
condition or having the initialization extra tricky. I'll sit on it a
little longer.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/10] sched: Remove one division operation in find_busiest_queue()
  2013-08-19 16:00 ` [PATCH 01/10] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
@ 2013-08-22  8:58   ` Paul Turner
  2013-08-22 10:25     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-22  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel

On Mon, Aug 19, 2013 at 9:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> Remove one division operation in find_busiest_queue() by using
> crosswise multiplication:
>
>         wl_i / power_i > wl_j / power_j :=
>         wl_i * power_j > wl_j * power_i
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> [peterz: expanded changelog]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1375778203-31343-2-git-send-email-iamjoonsoo.kim@lge.com
> ---
>  kernel/sched/fair.c |    9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5018,7 +5018,7 @@ static struct rq *find_busiest_queue(str
>                                      struct sched_group *group)
>  {
>         struct rq *busiest = NULL, *rq;
> -       unsigned long max_load = 0;
> +       unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;

Initializing this to SCHED_POWER_SCALE assigns a meaning that isn't
really there.  How about just 1?

>         int i;
>
>         for_each_cpu(i, sched_group_cpus(group)) {
> @@ -5049,10 +5049,9 @@ static struct rq *find_busiest_queue(str
>                  * the load can be moved away from the cpu that is potentially
>                  * running at a lower capacity.
>                  */
> -               wl = (wl * SCHED_POWER_SCALE) / power;
> -
> -               if (wl > max_load) {
> -                       max_load = wl;

A comment wouldn't hurt here.

> +               if (wl * busiest_power > busiest_load * power) {
> +                       busiest_load = wl;
> +                       busiest_power = power;
>                         busiest = rq;
>                 }
>         }
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] sched: Factor out code to should_we_balance()
  2013-08-19 16:01 ` [PATCH 02/10] sched: Factor out code to should_we_balance() Peter Zijlstra
@ 2013-08-22  9:58   ` Paul Turner
  2013-08-22 10:42     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-22  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> Now checking whether this cpu is appropriate to balance or not
> is embedded into update_sg_lb_stats() and this checking has no direct
> relationship to this function. There is not enough reason to place
> this checking at update_sg_lb_stats(), except saving one iteration
> for sched_group_cpus.
>
> In this patch, I factor out this checking to should_we_balance() function.
> And before doing actual work for load_balancing, check whether this cpu is
> appropriate to balance via should_we_balance(). If this cpu is not
> a candidate for balancing, it quit the work immediately.
>
> With this change, we can save two memset cost and can expect better
> compiler optimization.
>
> Below is result of this patch.
>
> * Vanilla *
>    text    data     bss     dec     hex filename
>   34499    1136     116   35751    8ba7 kernel/sched/fair.o
>
> * Patched *
>    text    data     bss     dec     hex filename
>   34243    1136     116   35495    8aa7 kernel/sched/fair.o
>
> In addition, rename @balance to @should_balance in order to represent
> its purpose more clearly.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> [peterz: style changes and a fix in should_we_balance() ]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1375778203-31343-3-git-send-email-iamjoonsoo.kim@lge.com
> ---
>  kernel/sched/fair.c |  111 +++++++++++++++++++++++++---------------------------
>  1 file changed, 54 insertions(+), 57 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4510,22 +4510,17 @@ fix_small_capacity(struct sched_domain *
>   * @group: sched_group whose statistics are to be updated.
>   * @load_idx: Load index of sched_domain of this_cpu for load calc.
>   * @local_group: Does group contain this_cpu.
> - * @balance: Should we balance.
>   * @sgs: variable to hold the statistics for this group.
>   */
>  static inline void update_sg_lb_stats(struct lb_env *env,
>                         struct sched_group *group, int load_idx,
> -                       int local_group, int *balance, struct sg_lb_stats *sgs)
> +                       int local_group, struct sg_lb_stats *sgs)
>  {
>         unsigned long nr_running, max_nr_running, min_nr_running;
>         unsigned long load, max_cpu_load, min_cpu_load;
> -       unsigned int balance_cpu = -1, first_idle_cpu = 0;
>         unsigned long avg_load_per_task = 0;
>         int i;
>
> -       if (local_group)
> -               balance_cpu = group_balance_cpu(group);
> -
>         /* Tally up the load of all CPUs in the group */
>         max_cpu_load = 0;
>         min_cpu_load = ~0UL;
> @@ -4538,15 +4533,9 @@ static inline void update_sg_lb_stats(st
>                 nr_running = rq->nr_running;
>
>                 /* Bias balancing toward cpus of our domain */
> -               if (local_group) {
> -                       if (idle_cpu(i) && !first_idle_cpu &&
> -                                       cpumask_test_cpu(i, sched_group_mask(group))) {
> -                               first_idle_cpu = 1;
> -                               balance_cpu = i;
> -                       }
> -
> +               if (local_group)
>                         load = target_load(i, load_idx);

Keep the braces here:

  if (local_group) {
    load = target_load(i, load_idx);
  } else {
   ...

> -               } else {
> +               else {
>                         load = source_load(i, load_idx);
>                         if (load > max_cpu_load)
>                                 max_cpu_load = load;
> @@ -4566,22 +4555,9 @@ static inline void update_sg_lb_stats(st
>                         sgs->idle_cpus++;
>         }
>
> -       /*
> -        * First idle cpu or the first cpu(busiest) in this sched group
> -        * is eligible for doing load balancing at this and above
> -        * domains. In the newly idle case, we will allow all the cpu's
> -        * to do the newly idle load balance.
> -        */
> -       if (local_group) {
> -               if (env->idle != CPU_NEWLY_IDLE) {
> -                       if (balance_cpu != env->dst_cpu) {
> -                               *balance = 0;
> -                               return;
> -                       }
> -                       update_group_power(env->sd, env->dst_cpu);
> -               } else if (time_after_eq(jiffies, group->sgp->next_update))
> -                       update_group_power(env->sd, env->dst_cpu);
> -       }
> +       if (local_group && (env->idle != CPU_NEWLY_IDLE ||
> +                       time_after_eq(jiffies, group->sgp->next_update)))
> +               update_group_power(env->sd, env->dst_cpu);
>
>         /* Adjust by relative CPU power of the group */
>         sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
> @@ -4663,7 +4639,7 @@ static bool update_sd_pick_busiest(struc
>   * @sds: variable to hold the statistics for this sched_domain.
>   */
>  static inline void update_sd_lb_stats(struct lb_env *env,
> -                                       int *balance, struct sd_lb_stats *sds)
> +                                       struct sd_lb_stats *sds)
>  {
>         struct sched_domain *child = env->sd->child;
>         struct sched_group *sg = env->sd->groups;
> @@ -4680,10 +4656,7 @@ static inline void update_sd_lb_stats(st
>
>                 local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
>                 memset(&sgs, 0, sizeof(sgs));
> -               update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
> -
> -               if (local_group && !(*balance))
> -                       return;
> +               update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
>
>                 sds->total_load += sgs.group_load;
>                 sds->total_pwr += sg->sgp->power;
> @@ -4916,8 +4889,6 @@ static inline void calculate_imbalance(s
>   * to restore balance.
>   *
>   * @env: The load balancing environment.
> - * @balance: Pointer to a variable indicating if this_cpu
> - *     is the appropriate cpu to perform load balancing at this_level.
>   *
>   * Return:     - The busiest group if imbalance exists.
>   *             - If no imbalance and user has opted for power-savings balance,
> @@ -4925,7 +4896,7 @@ static inline void calculate_imbalance(s
>   *                put to idle by rebalancing its tasks onto our group.
>   */
>  static struct sched_group *
> -find_busiest_group(struct lb_env *env, int *balance)
> +find_busiest_group(struct lb_env *env)
>  {
>         struct sd_lb_stats sds;
>
> @@ -4935,14 +4906,7 @@ find_busiest_group(struct lb_env *env, i
>          * Compute the various statistics relavent for load balancing at
>          * this level.
>          */
> -       update_sd_lb_stats(env, balance, &sds);
> -
> -       /*
> -        * this_cpu is not the appropriate cpu to perform load balancing at
> -        * this level.
> -        */
> -       if (!(*balance))
> -               goto ret;
> +       update_sd_lb_stats(env, &sds);
>
>         if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
>             check_asym_packing(env, &sds))
> @@ -5006,7 +4970,6 @@ find_busiest_group(struct lb_env *env, i
>         return sds.busiest;
>
>  out_balanced:
> -ret:
>         env->imbalance = 0;
>         return NULL;
>  }
> @@ -5088,13 +5051,47 @@ static int need_active_balance(struct lb
>
>  static int active_load_balance_cpu_stop(void *data);
>
> +static int should_we_balance(struct lb_env *env)
> +{
> +       struct sched_group *sg = env->sd->groups;
> +       struct cpumask *sg_cpus, *sg_mask;
> +       int cpu, balance_cpu = -1;
> +
> +       /*
> +        * In the newly idle case, we will allow all the cpu's
> +        * to do the newly idle load balance.
> +        */
> +       if (env->idle == CPU_NEWLY_IDLE)
> +               return 1;
> +
> +       sg_cpus = sched_group_cpus(sg);
> +       sg_mask = sched_group_mask(sg);
> +       /* Try to find first idle cpu */
> +       for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> +               if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +                       continue;
> +
> +               balance_cpu = cpu;
> +               break;
> +       }
> +
> +       if (balance_cpu == -1)
> +               balance_cpu = group_balance_cpu(sg);
> +
> +       /*
> +        * First idle cpu or the first cpu(busiest) in this sched group
> +        * is eligible for doing load balancing at this and above domains.
> +        */
> +       return balance_cpu != env->dst_cpu;
> +}
> +
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
>   */
>  static int load_balance(int this_cpu, struct rq *this_rq,
>                         struct sched_domain *sd, enum cpu_idle_type idle,
> -                       int *balance)
> +                       int *should_balance)
>  {
>         int ld_moved, cur_ld_moved, active_balance = 0;
>         struct sched_group *group;
> @@ -5123,12 +5120,11 @@ static int load_balance(int this_cpu, st
>
>         schedstat_inc(sd, lb_count[idle]);
>
> -redo:
> -       group = find_busiest_group(&env, balance);
> -
> -       if (*balance == 0)
> +       if (!(*should_balance = should_we_balance(&env)))
>                 goto out_balanced;

Given we always initialize *should_balance where we care, it might be
more readable to write this as:

if (!should_we_balance(&env)) {
  *continue_balancing = 0;
  goto out_balanced;
}

[ With a rename to can_balance ]

>
> +redo:

One behavioral change worth noting here is that in the redo case if a
CPU has become idle we'll continue trying to load-balance in the
!new-idle case.

This could be unpleasant in the case where a package has a pinned busy
core allowing this and a newly idle cpu to start dueling for load.

While more deterministically bad in this case now, it could racily do
this before anyway so perhaps not worth worrying about immediately.

> +       group = find_busiest_group(&env);
>         if (!group) {
>                 schedstat_inc(sd, lb_nobusyg[idle]);
>                 goto out_balanced;
> @@ -5340,7 +5336,7 @@ void idle_balance(int this_cpu, struct r
>         rcu_read_lock();
>         for_each_domain(this_cpu, sd) {
>                 unsigned long interval;
> -               int balance = 1;
> +               int should_balance;
>
>                 if (!(sd->flags & SD_LOAD_BALANCE))
>                         continue;
> @@ -5348,7 +5344,8 @@ void idle_balance(int this_cpu, struct r
>                 if (sd->flags & SD_BALANCE_NEWIDLE) {
>                         /* If we've pulled tasks over stop searching: */
>                         pulled_task = load_balance(this_cpu, this_rq,
> -                                                  sd, CPU_NEWLY_IDLE, &balance);
> +                                                  sd, CPU_NEWLY_IDLE,
> +                                                  &should_balance);
>                 }
>
>                 interval = msecs_to_jiffies(sd->balance_interval);
> @@ -5586,7 +5583,7 @@ void update_max_interval(void)
>   */
>  static void rebalance_domains(int cpu, enum cpu_idle_type idle)
>  {
> -       int balance = 1;
> +       int should_balance = 1;
>         struct rq *rq = cpu_rq(cpu);
>         unsigned long interval;
>         struct sched_domain *sd;
> @@ -5618,7 +5615,7 @@ static void rebalance_domains(int cpu, e
>                 }
>
>                 if (time_after_eq(jiffies, sd->last_balance + interval)) {
> -                       if (load_balance(cpu, rq, sd, idle, &balance)) {
> +                       if (load_balance(cpu, rq, sd, idle, &should_balance)) {
>                                 /*
>                                  * The LBF_SOME_PINNED logic could have changed
>                                  * env->dst_cpu, so we can't know our idle
> @@ -5641,7 +5638,7 @@ static void rebalance_domains(int cpu, e
>                  * CPU in our sched group which is doing load balancing more
>                  * actively.
>                  */
> -               if (!balance)
> +               if (!should_balance)
>                         break;
>         }
>         rcu_read_unlock();
>
>

Reviewed-by: Paul Turner <pjt@google.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/10] sched: Remove one division operation in find_busiest_queue()
  2013-08-22  8:58   ` Paul Turner
@ 2013-08-22 10:25     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-22 10:25 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel

On Thu, Aug 22, 2013 at 01:58:28AM -0700, Paul Turner wrote:

> >         wl_i / power_i > wl_j / power_j :=
> >         wl_i * power_j > wl_j * power_i
> >
> >         struct rq *busiest = NULL, *rq;
> > -       unsigned long max_load = 0;
> > +       unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
> 
> Initializing this to SCHED_POWER_SCALE assigns a meaning that isn't
> really there.  How about just 1?

Right, 1 works, all we really need is for wl to be > 0.

> >         int i;
> >
> >         for_each_cpu(i, sched_group_cpus(group)) {
> > @@ -5049,10 +5049,9 @@ static struct rq *find_busiest_queue(str
> >                  * the load can be moved away from the cpu that is potentially
> >                  * running at a lower capacity.
> >                  */
> > -               wl = (wl * SCHED_POWER_SCALE) / power;
> > -
> > -               if (wl > max_load) {
> > -                       max_load = wl;
> 
> A comment wouldn't hurt here.

Agreed, something like so?

/*
 * Since we're looking for max(wl_i / power_i) crosswise multiplication
 * to rid ourselves of the division works out to:
 * wl_i * power_j > wl_j * power_i;  where j is our previous maximum.
 */

> > +               if (wl * busiest_power > busiest_load * power) {
> > +                       busiest_load = wl;
> > +                       busiest_power = power;
> >                         busiest = rq;
> >                 }
> >         }
> >
> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] sched: Factor out code to should_we_balance()
  2013-08-22  9:58   ` Paul Turner
@ 2013-08-22 10:42     ` Peter Zijlstra
  2013-08-23  4:51       ` Joonsoo Kim
  2013-08-23 11:37       ` Paul Turner
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-22 10:42 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Thu, Aug 22, 2013 at 02:58:27AM -0700, Paul Turner wrote:
> On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > +               if (local_group)
> >                         load = target_load(i, load_idx);
> 
> Keep the braces here:
> 
>   if (local_group) {
>     load = target_load(i, load_idx);
>   } else {
>    ...

Right you are, I misplaced that hunk in the next patch.

> > -               } else {
> > +               else {
> >                         load = source_load(i, load_idx);
> >                         if (load > max_cpu_load)
> >                                 max_cpu_load = load;


> > @@ -5123,12 +5120,11 @@ static int load_balance(int this_cpu, st
> >
> >         schedstat_inc(sd, lb_count[idle]);
> >
> > -redo:
> > -       group = find_busiest_group(&env, balance);
> > -
> > -       if (*balance == 0)
> > +       if (!(*should_balance = should_we_balance(&env)))
> >                 goto out_balanced;
> 
> Given we always initialize *should_balance where we care, it might be
> more readable to write this as:

Ah, but we don't actually, idle_balance() doesn't initialize
should_balance -- easy enough to fix though.

> if (!should_we_balance(&env)) {
>   *continue_balancing = 0;
>   goto out_balanced;
> }
> 
> [ With a rename to can_balance ]

You confused me, your code implied
%s/should_balance/continue_balancing/g but now you suggest
%%s/should_balance/can_balance/g ?

I'm fine with either.

> >
> > +redo:
> 
> One behavioral change worth noting here is that in the redo case if a
> CPU has become idle we'll continue trying to load-balance in the
> !new-idle case.
> 
> This could be unpleasant in the case where a package has a pinned busy
> core allowing this and a newly idle cpu to start dueling for load.
> 
> While more deterministically bad in this case now, it could racily do
> this before anyway so perhaps not worth worrying about immediately.

Ah, because the old code would effectively redo the check and find the
idle cpu and thereby our cpu would no longer be the balance_cpu.

Indeed. And I don't think this was an intentional change. I'll go put
the redo back before should_we_balance().


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/10] sched, fair: Make group power more consitent
  2013-08-19 16:01 ` [PATCH 06/10] sched, fair: Make group power more consitent Peter Zijlstra
@ 2013-08-23  3:40   ` Preeti U Murthy
  0 siblings, 0 replies; 47+ messages in thread
From: Preeti U Murthy @ 2013-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Rik van Riel, Joonsoo Kim

On 08/19/2013 09:31 PM, Peter Zijlstra wrote:


Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] sched: Factor out code to should_we_balance()
  2013-08-22 10:42     ` Peter Zijlstra
@ 2013-08-23  4:51       ` Joonsoo Kim
  2013-08-23 11:37       ` Paul Turner
  1 sibling, 0 replies; 47+ messages in thread
From: Joonsoo Kim @ 2013-08-23  4:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Ingo Molnar, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel

On Thu, Aug 22, 2013 at 12:42:57PM +0200, Peter Zijlstra wrote:
> > >
> > > +redo:
> > 
> > One behavioral change worth noting here is that in the redo case if a
> > CPU has become idle we'll continue trying to load-balance in the
> > !new-idle case.
> > 
> > This could be unpleasant in the case where a package has a pinned busy
> > core allowing this and a newly idle cpu to start dueling for load.
> > 
> > While more deterministically bad in this case now, it could racily do
> > this before anyway so perhaps not worth worrying about immediately.
> 
> Ah, because the old code would effectively redo the check and find the
> idle cpu and thereby our cpu would no longer be the balance_cpu.
> 
> Indeed. And I don't think this was an intentional change. I'll go put
> the redo back before should_we_balance().

Ah, yes.
It isn't my intention. Please fix it.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-19 16:01 ` [PATCH 07/10] sched, fair: Optimize find_busiest_queue() Peter Zijlstra
@ 2013-08-23  8:11   ` Preeti U Murthy
  2013-08-23 10:03     ` Peter Zijlstra
  2013-08-24 10:33   ` Paul Turner
  1 sibling, 1 reply; 47+ messages in thread
From: Preeti U Murthy @ 2013-08-23  8:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Rik van Riel, Joonsoo Kim

Hi Peter,

On 08/19/2013 09:31 PM, Peter Zijlstra wrote:

In the load balancing code, looks to me that
cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all,
before calling find_busiest_group(). Am I missing something?

Should not cpumask_copy() below be before we update the env.cpus parameter?

static int load_balance(int this_cpu, struct rq *this_rq,
			struct sched_domain *sd, enum cpu_idle_type idle,
			int *balance)
{
	int ld_moved, cur_ld_moved, active_balance = 0;
	struct sched_group *group;
	struct rq *busiest;
	unsigned long flags;
	struct cpumask *cpus = __get_cpu_var(load_balance_mask);

	struct lb_env env = {
		.sd		= sd,
		.dst_cpu	= this_cpu,
		.dst_rq		= this_rq,
		.dst_grpmask    = sched_group_cpus(sd->groups),
		.idle		= idle,
		.loop_break	= sched_nr_migrate_break,
		.cpus		= cpus,
	};

	/*
	 * For NEWLY_IDLE load_balancing, we don't need to consider
	 * other cpus in our group
	 */
	if (idle == CPU_NEWLY_IDLE)
		env.dst_grpmask = NULL;

	cpumask_copy(cpus, cpu_active_mask);

	schedstat_inc(sd, lb_count[idle]);
redo:
	group = find_busiest_group(&env, balance);

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-23  8:11   ` Preeti U Murthy
@ 2013-08-23 10:03     ` Peter Zijlstra
  2013-08-23 10:54       ` Preeti U Murthy
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-23 10:03 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Rik van Riel, Joonsoo Kim

On Fri, Aug 23, 2013 at 01:41:55PM +0530, Preeti U Murthy wrote:
> Hi Peter,
> 
> On 08/19/2013 09:31 PM, Peter Zijlstra wrote:
> 
> In the load balancing code, looks to me that
> cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all,
> before calling find_busiest_group(). Am I missing something?
> 
> Should not cpumask_copy() below be before we update the env.cpus parameter?

Nah, its all pointers.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-23 10:03     ` Peter Zijlstra
@ 2013-08-23 10:54       ` Preeti U Murthy
  0 siblings, 0 replies; 47+ messages in thread
From: Preeti U Murthy @ 2013-08-23 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Rik van Riel, Joonsoo Kim

On 08/23/2013 03:33 PM, Peter Zijlstra wrote:
> On Fri, Aug 23, 2013 at 01:41:55PM +0530, Preeti U Murthy wrote:
>> Hi Peter,
>>
>> On 08/19/2013 09:31 PM, Peter Zijlstra wrote:
>>
>> In the load balancing code, looks to me that
>> cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all,
>> before calling find_busiest_group(). Am I missing something?
>>
>> Should not cpumask_copy() below be before we update the env.cpus parameter?
> 
> Nah, its all pointers.
> 

Yikes! Of course :P

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] sched: Factor out code to should_we_balance()
  2013-08-22 10:42     ` Peter Zijlstra
  2013-08-23  4:51       ` Joonsoo Kim
@ 2013-08-23 11:37       ` Paul Turner
  1 sibling, 0 replies; 47+ messages in thread
From: Paul Turner @ 2013-08-23 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Thu, Aug 22, 2013 at 3:42 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Aug 22, 2013 at 02:58:27AM -0700, Paul Turner wrote:
>> On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > +               if (local_group)
>> >                         load = target_load(i, load_idx);
>>
>> Keep the braces here:
>>
>>   if (local_group) {
>>     load = target_load(i, load_idx);
>>   } else {
>>    ...
>
> Right you are, I misplaced that hunk in the next patch.
>
>> > -               } else {
>> > +               else {
>> >                         load = source_load(i, load_idx);
>> >                         if (load > max_cpu_load)
>> >                                 max_cpu_load = load;
>
>
>> > @@ -5123,12 +5120,11 @@ static int load_balance(int this_cpu, st
>> >
>> >         schedstat_inc(sd, lb_count[idle]);
>> >
>> > -redo:
>> > -       group = find_busiest_group(&env, balance);
>> > -
>> > -       if (*balance == 0)
>> > +       if (!(*should_balance = should_we_balance(&env)))
>> >                 goto out_balanced;
>>
>> Given we always initialize *should_balance where we care, it might be
>> more readable to write this as:
>
> Ah, but we don't actually, idle_balance() doesn't initialize
> should_balance -- easy enough to fix though.

I should have been more explicit here.

idle_balance() doesn't initialize it, but it completely ignores the
result; it's just passing something in case a write occurs.
(Arguably it should be int unused = 1 instead of int balance = 1)

>
>> if (!should_we_balance(&env)) {
>>   *continue_balancing = 0;
>>   goto out_balanced;
>> }
>>
>> [ With a rename to can_balance ]
>
> You confused me, your code implied
> %s/should_balance/continue_balancing/g but now you suggest
> %%s/should_balance/can_balance/g ?
>
> I'm fine with either.

My bad, I started typing this comment, then went and looked at other
parts of the patch and came back to it :)

I think continue_balancing is more clear.

>
>> >
>> > +redo:
>>
>> One behavioral change worth noting here is that in the redo case if a
>> CPU has become idle we'll continue trying to load-balance in the
>> !new-idle case.
>>
>> This could be unpleasant in the case where a package has a pinned busy
>> core allowing this and a newly idle cpu to start dueling for load.
>>
>> While more deterministically bad in this case now, it could racily do
>> this before anyway so perhaps not worth worrying about immediately.
>
> Ah, because the old code would effectively redo the check and find the
> idle cpu and thereby our cpu would no longer be the balance_cpu.
>
> Indeed. And I don't think this was an intentional change. I'll go put
> the redo back before should_we_balance().
>

I was trying to get through the rest of these today but I'm out of
time.  I should be able to finish reviewing the other patches in the
set tomorrow.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat
  2013-08-19 16:01 ` [PATCH 03/10] sched: Clean-up struct sd_lb_stat Peter Zijlstra
@ 2013-08-24 10:09   ` Paul Turner
  2013-08-26 11:38     ` Peter Zijlstra
  2013-08-26  2:56   ` Lei Wen
  1 sibling, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-24 10:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> There is no reason to maintain separate variables for this_group
> and busiest_group in sd_lb_stat, except saving some space.
> But this structure is always allocated in stack, so this saving
> isn't really benificial [peterz: reducing stack space is good; in this
> case readability increases enough that I think its still beneficial]
>
> This patch unify these variables, so IMO, readability may be improved.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> [peterz: lots of style edits, a few fixes and a rename]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com
> ---
>  kernel/sched/fair.c |  225 +++++++++++++++++++++++++---------------------------
>  1 file changed, 112 insertions(+), 113 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct
>
>  /********** Helpers for find_busiest_group ************************/
>  /*
> - * sd_lb_stats - Structure to store the statistics of a sched_domain
> - *             during load balancing.
> - */
> -struct sd_lb_stats {
> -       struct sched_group *busiest; /* Busiest group in this sd */
> -       struct sched_group *this;  /* Local group in this sd */
> -       unsigned long total_load;  /* Total load of all groups in sd */
> -       unsigned long total_pwr;   /*   Total power of all groups in sd */
> -       unsigned long avg_load;    /* Average load across all groups in sd */
> -
> -       /** Statistics of this group */
> -       unsigned long this_load;
> -       unsigned long this_load_per_task;
> -       unsigned long this_nr_running;
> -       unsigned long this_has_capacity;
> -       unsigned int  this_idle_cpus;
> -
> -       /* Statistics of the busiest group */
> -       unsigned int  busiest_idle_cpus;
> -       unsigned long max_load;
> -       unsigned long busiest_load_per_task;
> -       unsigned long busiest_nr_running;
> -       unsigned long busiest_group_capacity;
> -       unsigned long busiest_has_capacity;
> -       unsigned int  busiest_group_weight;
> -
> -       int group_imb; /* Is there imbalance in this sd */
> -};
> -
> -/*
>   * sg_lb_stats - stats of a sched_group required for load_balancing
>   */
>  struct sg_lb_stats {
> @@ -4314,6 +4284,7 @@ struct sg_lb_stats {
>         unsigned long group_load; /* Total load over the CPUs of the group */
>         unsigned long sum_nr_running; /* Nr tasks running in the group */
>         unsigned long sum_weighted_load; /* Weighted load of group's tasks */
> +       unsigned long load_per_task;
>         unsigned long group_capacity;
>         unsigned long idle_cpus;
>         unsigned long group_weight;
> @@ -4321,6 +4292,21 @@ struct sg_lb_stats {
>         int group_has_capacity; /* Is there extra capacity in the group? */
>  };
>
> +/*
> + * sd_lb_stats - Structure to store the statistics of a sched_domain
> + *              during load balancing.
> + */
> +struct sd_lb_stats {
> +       struct sched_group *busiest;    /* Busiest group in this sd */
> +       struct sched_group *this;       /* Local group in this sd */
> +       unsigned long total_load;       /* Total load of all groups in sd */
> +       unsigned long total_pwr;        /* Total power of all groups in sd */
> +       unsigned long avg_load; /* Average load across all groups in sd */
> +
> +       struct sg_lb_stats this_stat;   /* Statistics of this group */
> +       struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> +};
> +
>  /**
>   * get_sd_load_idx - Obtain the load index for a given sched domain.
>   * @sd: The sched_domain whose load_idx is to be obtained.
> @@ -4533,10 +4519,11 @@ static inline void update_sg_lb_stats(st
>                 nr_running = rq->nr_running;
>
>                 /* Bias balancing toward cpus of our domain */
> -               if (local_group)
> +               if (local_group) {
>                         load = target_load(i, load_idx);
> -               else {
> +               } else {
>                         load = source_load(i, load_idx);
> +
>                         if (load > max_cpu_load)
>                                 max_cpu_load = load;
>                         if (min_cpu_load > load)
> @@ -4578,10 +4565,12 @@ static inline void update_sg_lb_stats(st
>             (max_nr_running - min_nr_running) > 1)
>                 sgs->group_imb = 1;
>
> -       sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
> -                                               SCHED_POWER_SCALE);
> +       sgs->group_capacity =
> +               DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE);
> +
>         if (!sgs->group_capacity)
>                 sgs->group_capacity = fix_small_capacity(env->sd, group);
> +
>         sgs->group_weight = group->group_weight;
>
>         if (sgs->group_capacity > sgs->sum_nr_running)
> @@ -4606,7 +4595,7 @@ static bool update_sd_pick_busiest(struc
>                                    struct sched_group *sg,
>                                    struct sg_lb_stats *sgs)
>  {
> -       if (sgs->avg_load <= sds->max_load)
> +       if (sgs->avg_load <= sds->busiest_stat.avg_load)
>                 return false;
>
>         if (sgs->sum_nr_running > sgs->group_capacity)
> @@ -4643,7 +4632,7 @@ static inline void update_sd_lb_stats(st
>  {
>         struct sched_domain *child = env->sd->child;
>         struct sched_group *sg = env->sd->groups;
> -       struct sg_lb_stats sgs;
> +       struct sg_lb_stats tmp_sgs;
>         int load_idx, prefer_sibling = 0;
>
>         if (child && child->flags & SD_PREFER_SIBLING)
> @@ -4652,14 +4641,17 @@ static inline void update_sd_lb_stats(st
>         load_idx = get_sd_load_idx(env->sd, env->idle);
>
>         do {
> +               struct sg_lb_stats *sgs = &tmp_sgs;
>                 int local_group;
>
>                 local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
> -               memset(&sgs, 0, sizeof(sgs));
> -               update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
> +               if (local_group) {
> +                       sds->this = sg;
> +                       sgs = &sds->this_stat;
> +               }
>
> -               sds->total_load += sgs.group_load;
> -               sds->total_pwr += sg->sgp->power;
> +               memset(sgs, 0, sizeof(*sgs));
> +               update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
>
>                 /*
>                  * In case the child domain prefers tasks go to siblings
> @@ -4671,26 +4663,17 @@ static inline void update_sd_lb_stats(st
>                  * heaviest group when it is already under-utilized (possible
>                  * with a large weight task outweighs the tasks on the system).
>                  */
> -               if (prefer_sibling && !local_group && sds->this_has_capacity)
> -                       sgs.group_capacity = min(sgs.group_capacity, 1UL);
> +               if (prefer_sibling && !local_group &&
> +                               sds->this && sds->this_stat.group_has_capacity)
> +                       sgs->group_capacity = min(sgs->group_capacity, 1UL);
>
> -               if (local_group) {
> -                       sds->this_load = sgs.avg_load;
> -                       sds->this = sg;
> -                       sds->this_nr_running = sgs.sum_nr_running;
> -                       sds->this_load_per_task = sgs.sum_weighted_load;
> -                       sds->this_has_capacity = sgs.group_has_capacity;
> -                       sds->this_idle_cpus = sgs.idle_cpus;
> -               } else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
> -                       sds->max_load = sgs.avg_load;
> +               /* Now, start updating sd_lb_stats */
> +               sds->total_load += sgs->group_load;
> +               sds->total_pwr += sg->sgp->power;
> +
> +               if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
>                         sds->busiest = sg;
> -                       sds->busiest_nr_running = sgs.sum_nr_running;
> -                       sds->busiest_idle_cpus = sgs.idle_cpus;
> -                       sds->busiest_group_capacity = sgs.group_capacity;
> -                       sds->busiest_load_per_task = sgs.sum_weighted_load;
> -                       sds->busiest_has_capacity = sgs.group_has_capacity;
> -                       sds->busiest_group_weight = sgs.group_weight;
> -                       sds->group_imb = sgs.group_imb;
> +                       sds->busiest_stat = *sgs;
>                 }
>
>                 sg = sg->next;
> @@ -4734,8 +4717,8 @@ static int check_asym_packing(struct lb_
>         if (env->dst_cpu > busiest_cpu)
>                 return 0;
>
> -       env->imbalance = DIV_ROUND_CLOSEST(
> -               sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
> +       env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
> +                               sds->busiest->sgp->power, SCHED_POWER_SCALE);
>
>         return 1;
>  }
> @@ -4753,24 +4736,23 @@ void fix_small_imbalance(struct lb_env *
>         unsigned long tmp, pwr_now = 0, pwr_move = 0;
>         unsigned int imbn = 2;
>         unsigned long scaled_busy_load_per_task;
> +       struct sg_lb_stats *this, *busiest;
>
> -       if (sds->this_nr_running) {
> -               sds->this_load_per_task /= sds->this_nr_running;
> -               if (sds->busiest_load_per_task >
> -                               sds->this_load_per_task)
> -                       imbn = 1;
> -       } else {
> -               sds->this_load_per_task =
> -                       cpu_avg_load_per_task(env->dst_cpu);
> -       }
> +       this = &sds->this_stat;
> +       busiest = &sds->busiest_stat;
> +
> +       if (!this->sum_nr_running)
> +               this->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
> +       else if (busiest->load_per_task > this->load_per_task)
> +               imbn = 1;
>
> -       scaled_busy_load_per_task = sds->busiest_load_per_task
> -                                        * SCHED_POWER_SCALE;
> -       scaled_busy_load_per_task /= sds->busiest->sgp->power;
> +       scaled_busy_load_per_task =
> +               (busiest->load_per_task * SCHED_POWER_SCALE) /
> +               sds->busiest->sgp->power;
>
> -       if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> -                       (scaled_busy_load_per_task * imbn)) {
> -               env->imbalance = sds->busiest_load_per_task;
> +       if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
> +           (scaled_busy_load_per_task * imbn)) {
> +               env->imbalance = busiest->load_per_task;
>                 return;
>         }
>
> @@ -4781,33 +4763,36 @@ void fix_small_imbalance(struct lb_env *
>          */
>
>         pwr_now += sds->busiest->sgp->power *
> -                       min(sds->busiest_load_per_task, sds->max_load);
> +                       min(busiest->load_per_task, busiest->avg_load);
>         pwr_now += sds->this->sgp->power *
> -                       min(sds->this_load_per_task, sds->this_load);
> +                       min(this->load_per_task, this->avg_load);
>         pwr_now /= SCHED_POWER_SCALE;
>
>         /* Amount of load we'd subtract */
> -       tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
> +       tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
>                 sds->busiest->sgp->power;
> -       if (sds->max_load > tmp)
> +       if (busiest->avg_load > tmp) {
>                 pwr_move += sds->busiest->sgp->power *
> -                       min(sds->busiest_load_per_task, sds->max_load - tmp);
> +                           min(busiest->load_per_task,
> +                               busiest->avg_load - tmp);
> +       }
>
>         /* Amount of load we'd add */
> -       if (sds->max_load * sds->busiest->sgp->power <
> -               sds->busiest_load_per_task * SCHED_POWER_SCALE)
> -               tmp = (sds->max_load * sds->busiest->sgp->power) /
> +       if (busiest->avg_load * sds->busiest->sgp->power <
> +           busiest->load_per_task * SCHED_POWER_SCALE) {
> +               tmp = (busiest->avg_load * sds->busiest->sgp->power) /
>                         sds->this->sgp->power;
> -       else
> -               tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
> +       } else {
> +               tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
>                         sds->this->sgp->power;
> +       }
>         pwr_move += sds->this->sgp->power *
> -                       min(sds->this_load_per_task, sds->this_load + tmp);
> +                       min(this->load_per_task, this->avg_load + tmp);
>         pwr_move /= SCHED_POWER_SCALE;
>
>         /* Move if we gain throughput */
>         if (pwr_move > pwr_now)
> -               env->imbalance = sds->busiest_load_per_task;
> +               env->imbalance = busiest->load_per_task;
>  }
>
>  /**
> @@ -4819,11 +4804,22 @@ void fix_small_imbalance(struct lb_env *
>  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
>  {
>         unsigned long max_pull, load_above_capacity = ~0UL;
> +       struct sg_lb_stats *this, *busiest;
>
> -       sds->busiest_load_per_task /= sds->busiest_nr_running;
> -       if (sds->group_imb) {
> -               sds->busiest_load_per_task =
> -                       min(sds->busiest_load_per_task, sds->avg_load);
> +       this = &sds->this_stat;
> +       if (this->sum_nr_running) {
> +               this->load_per_task =
> +                       this->sum_weighted_load / this->sum_nr_running;
> +       }
> +
> +       busiest = &sds->busiest_stat;
> +       /* busiest must have some tasks */
> +       busiest->load_per_task =
> +               busiest->sum_weighted_load / busiest->sum_nr_running;
> +
> +       if (busiest->group_imb) {
> +               busiest->load_per_task =
> +                       min(busiest->load_per_task, sds->avg_load);
>         }
>
>         /*
> @@ -4831,20 +4827,19 @@ static inline void calculate_imbalance(s
>          * max load less than avg load(as we skip the groups at or below
>          * its cpu_power, while calculating max_load..)
>          */
> -       if (sds->max_load < sds->avg_load) {
> +       if (busiest->avg_load < sds->avg_load) {
>                 env->imbalance = 0;
>                 return fix_small_imbalance(env, sds);
>         }
>
> -       if (!sds->group_imb) {
> +       if (!busiest->group_imb) {
>                 /*
>                  * Don't want to pull so many tasks that a group would go idle.
>                  */
> -               load_above_capacity = (sds->busiest_nr_running -
> -                                               sds->busiest_group_capacity);
> +               load_above_capacity =
> +                       (busiest->sum_nr_running - busiest->group_capacity);
>
>                 load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
> -
>                 load_above_capacity /= sds->busiest->sgp->power;
>         }
>
> @@ -4858,12 +4853,14 @@ static inline void calculate_imbalance(s
>          * Be careful of negative numbers as they'll appear as very large values
>          * with unsigned longs.
>          */
> -       max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
> +       max_pull = min(busiest->avg_load - sds->avg_load,
> +                      load_above_capacity);
>
>         /* How much load to actually move to equalise the imbalance */
> -       env->imbalance = min(max_pull * sds->busiest->sgp->power,
> -               (sds->avg_load - sds->this_load) * sds->this->sgp->power)
> -                       / SCHED_POWER_SCALE;
> +       env->imbalance = min(
> +               max_pull * sds->busiest->sgp->power,
> +               (sds->avg_load - this->avg_load) * sds->this->sgp->power
> +       ) / SCHED_POWER_SCALE;
>
>         /*
>          * if *imbalance is less than the average load per runnable task
> @@ -4871,9 +4868,8 @@ static inline void calculate_imbalance(s
>          * a think about bumping its value to force at least one task to be
>          * moved
>          */
> -       if (env->imbalance < sds->busiest_load_per_task)
> +       if (env->imbalance < busiest->load_per_task)
>                 return fix_small_imbalance(env, sds);
> -
>  }
>
>  /******* find_busiest_group() helpers end here *********************/
> @@ -4895,9 +4891,9 @@ static inline void calculate_imbalance(s
>   *                return the least loaded group whose CPUs can be
>   *                put to idle by rebalancing its tasks onto our group.
>   */
> -static struct sched_group *
> -find_busiest_group(struct lb_env *env)
> +static struct sched_group *find_busiest_group(struct lb_env *env)
>  {
> +       struct sg_lb_stats *this, *busiest;

"this" is a little confusing to read; mainly because elsewhere we've
tied this to "this cpu" whereas the local sched group is arger.  (Not
to mention the obvious OOP-land overloading of "this->".)

Perhaps %s/this/local/ for sg_lb_stat references?  Including this_stat
-> local_stat on sd_lb_stats?

>         struct sd_lb_stats sds;
>
>         memset(&sds, 0, sizeof(sds));
> @@ -4907,13 +4903,15 @@ find_busiest_group(struct lb_env *env)
>          * this level.
>          */
>         update_sd_lb_stats(env, &sds);
> +       this = &sds.this_stat;
> +       busiest = &sds.busiest_stat;
>
>         if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
>             check_asym_packing(env, &sds))
>                 return sds.busiest;
>
>         /* There is no busy sibling group to pull tasks from */
> -       if (!sds.busiest || sds.busiest_nr_running == 0)
> +       if (!sds.busiest || busiest->sum_nr_running == 0)
>                 goto out_balanced;
>
>         sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
> @@ -4923,26 +4921,26 @@ find_busiest_group(struct lb_env *env)
>          * work because they assumes all things are equal, which typically
>          * isn't true due to cpus_allowed constraints and the like.
>          */
> -       if (sds.group_imb)
> +       if (busiest->group_imb)
>                 goto force_balance;
>
>         /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> -       if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
> -                       !sds.busiest_has_capacity)
> +       if (env->idle == CPU_NEWLY_IDLE && this->group_has_capacity &&
> +           !busiest->group_has_capacity)
>                 goto force_balance;
>
>         /*
>          * If the local group is more busy than the selected busiest group
>          * don't try and pull any tasks.
>          */
> -       if (sds.this_load >= sds.max_load)
> +       if (this->avg_load >= busiest->avg_load)
>                 goto out_balanced;
>
>         /*
>          * Don't pull any tasks if this group is already above the domain
>          * average load.
>          */
> -       if (sds.this_load >= sds.avg_load)
> +       if (this->avg_load >= sds.avg_load)
>                 goto out_balanced;
>
>         if (env->idle == CPU_IDLE) {
> @@ -4952,15 +4950,16 @@ find_busiest_group(struct lb_env *env)
>                  * there is no imbalance between this and busiest group
>                  * wrt to idle cpu's, it is balanced.
>                  */
> -               if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
> -                   sds.busiest_nr_running <= sds.busiest_group_weight)
> +               if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
> +                   busiest->sum_nr_running <= busiest->group_weight)

While we're improving readability:  idle_cpus < busiest->idle_cpus ?

This check has always been a little oddly asymmetric in that:
  group_weight - sum_nr_running <= idle_cpus

This allows the case where our group has pulled lots of work to one of
its cpus, but not yet spread that out, but still keep trying to
balance because idle_cpus is high.

This is more food for thought since this patch is not changing functionality.

>                         goto out_balanced;
>         } else {
>                 /*
>                  * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
>                  * imbalance_pct to be conservative.
>                  */
> -               if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
> +               if (100 * busiest->avg_load <=
> +                               env->sd->imbalance_pct * this->avg_load)
>                         goto out_balanced;
>         }
>
>
Reviewed-by: Paul  Turner <pjt@google.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-19 16:01 ` [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games Peter Zijlstra
  2013-08-21  2:08   ` Joonsoo Kim
@ 2013-08-24 10:15   ` Paul Turner
  2013-08-26 11:46     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-24 10:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> We can shrink sg_lb_stats because rq::nr_running is an 'unsigned int'
> and cpu numbers are 'int'
>
> Before:
>   sgs:        /* size: 72, cachelines: 2, members: 10 */
>   sds:        /* size: 184, cachelines: 3, members: 7 */
>
> After:
>   sgs:        /* size: 56, cachelines: 1, members: 10 */
>   sds:        /* size: 152, cachelines: 3, members: 7 */
>
> Further we can avoid clearing all of sds since we do a total
> clear/assignment of sg_stats in update_sg_lb_stats() with exception of
> busiest_stat.avg_load which is referenced in update_sd_pick_busiest().
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++++++++++++-------
>  1 file changed, 35 insertions(+), 7 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4282,12 +4282,12 @@ static unsigned long task_h_load(struct
>  struct sg_lb_stats {
>         unsigned long avg_load; /*Avg load across the CPUs of the group */
>         unsigned long group_load; /* Total load over the CPUs of the group */
> -       unsigned long sum_nr_running; /* Nr tasks running in the group */
>         unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>         unsigned long load_per_task;
> -       unsigned long group_capacity;
> -       unsigned long idle_cpus;
> -       unsigned long group_weight;
> +       unsigned int sum_nr_running; /* Nr tasks running in the group */
> +       unsigned int group_capacity;
> +       unsigned int idle_cpus;
> +       unsigned int group_weight;
>         int group_imb; /* Is there an imbalance in the group ? */
>         int group_has_capacity; /* Is there extra capacity in the group? */
>  };
> @@ -4303,10 +4303,38 @@ struct sd_lb_stats {
>         unsigned long total_pwr;        /* Total power of all groups in sd */
>         unsigned long avg_load; /* Average load across all groups in sd */
>
> -       struct sg_lb_stats this_stat;   /* Statistics of this group */
>         struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> +       struct sg_lb_stats this_stat;   /* Statistics of this group */
>  };
>
> +static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> +{
> +       /*
> +        * struct sd_lb_stats {
> +        *         struct sched_group *       busiest;             //     0  8
> +        *         struct sched_group *       this;                //     8  8
> +        *         long unsigned int          total_load;          //    16  8
> +        *         long unsigned int          total_pwr;           //    24  8
> +        *         long unsigned int          avg_load;            //    32  8
> +        *         struct sg_lb_stats {
> +        *                 long unsigned int  avg_load;            //    40  8
> +        *                 long unsigned int  group_load;          //    48  8
> +        *                 ...
> +        *         } busiest_stat;                                 //    40 56
> +        *         struct sg_lb_stats         this_stat;           //    96 56
> +        *
> +        *         // size: 152, cachelines: 3, members: 7
> +        *         // last cacheline: 24 bytes
> +        * };
> +        *
> +        * Skimp on the clearing to avoid duplicate work. We can avoid clearing
> +        * this_stat because update_sg_lb_stats() does a full clear/assignment.
> +        * We must however clear busiest_stat::avg_load because
> +        * update_sd_pick_busiest() reads this before assignment.
> +        */

Does gcc seriously not get this right if we just write:
  sds->busiest = NULL;
  sds->local = NULL;
  ....

etc.?

> +       memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
> +}
> +
>  /**
>   * get_sd_load_idx - Obtain the load index for a given sched domain.
>   * @sd: The sched_domain whose load_idx is to be obtained.
> @@ -4665,7 +4693,7 @@ static inline void update_sd_lb_stats(st
>                  */
>                 if (prefer_sibling && !local_group &&
>                                 sds->this && sds->this_stat.group_has_capacity)
> -                       sgs->group_capacity = min(sgs->group_capacity, 1UL);
> +                       sgs->group_capacity = min(sgs->group_capacity, 1U);
>
>                 /* Now, start updating sd_lb_stats */
>                 sds->total_load += sgs->group_load;
> @@ -4896,7 +4924,7 @@ static struct sched_group *find_busiest_
>         struct sg_lb_stats *this, *busiest;
>         struct sd_lb_stats sds;
>
> -       memset(&sds, 0, sizeof(sds));
> +       init_sd_lb_stats(&sds);
>
>         /*
>          * Compute the various statistics relavent for load balancing at
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-19 16:01 ` [PATCH 07/10] sched, fair: Optimize find_busiest_queue() Peter Zijlstra
  2013-08-23  8:11   ` Preeti U Murthy
@ 2013-08-24 10:33   ` Paul Turner
  2013-08-26 12:07     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-24 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> Use for_each_cpu_and() and thereby avoid computing the capacity for
> CPUs we know we're not interested in.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4977,7 +4977,7 @@ static struct rq *find_busiest_queue(str
>         unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
>         int i;
>
> -       for_each_cpu(i, sched_group_cpus(group)) {
> +       for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>                 unsigned long power = power_of(i);
>                 unsigned long capacity = DIV_ROUND_CLOSEST(power,
>                                                            SCHED_POWER_SCALE);
> @@ -4986,9 +4986,6 @@ static struct rq *find_busiest_queue(str
>                 if (!capacity)
>                         capacity = fix_small_capacity(env->sd, group);
>
> -               if (!cpumask_test_cpu(i, env->cpus))
> -                       continue;
> -
>                 rq = cpu_rq(i);
>                 wl = weighted_cpuload(i);

There's no need to actually do the divisions immediately below this also.

e.g.
  unsigned long max_load_power = SCHED_POWER_SCALE;
  ...
  if (wl * max_load_power > max_load * power) {
   max_load = wl;
   max_load_power = power;
   ...

This would actually end up being a little more accurate even.

[ Alternatively without caching max_load_power we could compare wl *
power vs max_load * SCHED_POWER_SCALE. ]

Reviewed-by: Paul Turner <pjt@google.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code
  2013-08-19 16:01 ` [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code Peter Zijlstra
@ 2013-08-24 10:45   ` Paul Turner
  2013-08-26 12:09     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-24 10:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I found that on my wsm box I had a redundant domain:
>
> [    0.949769] CPU0 attaching sched-domain:
> [    0.953765]  domain 0: span 0,12 level SIBLING
> [    0.958335]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
> [    0.964548]   domain 1: span 0-5,12-17 level MC
> [    0.969206]    groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
> [    0.984993]    domain 2: span 0-5,12-17 level CPU
> [    0.989822]     groups: 0-5,12-17 (cpu_power = 7055)
> [    0.995049]     domain 3: span 0-23 level NUMA
> [    0.999620]      groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
>
> Note how domain 2 has only a single group and spans the same CPUs as
> domain 1. We should not keep such domains and do in fact have code to
> prune these.
>
> It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
> makes sd_parent_degenerate() fail on the CPU domain. We can easily fix
> this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it to
> whatever domain ends up covering the span.
>
> With this patch the domains now look like this:
>
> [    0.950419] CPU0 attaching sched-domain:
> [    0.954454]  domain 0: span 0,12 level SIBLING
> [    0.959039]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
> [    0.965271]   domain 1: span 0-5,12-17 level MC
> [    0.969936]    groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
> [    0.985737]    domain 2: span 0-23 level NUMA
> [    0.990231]     groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4948,7 +4948,8 @@ sd_parent_degenerate(struct sched_domain
>                                 SD_BALANCE_FORK |
>                                 SD_BALANCE_EXEC |
>                                 SD_SHARE_CPUPOWER |
> -                               SD_SHARE_PKG_RESOURCES);
> +                               SD_SHARE_PKG_RESOURCES |
> +                               SD_PREFER_SIBLING);
>                 if (nr_node_ids == 1)
>                         pflags &= ~SD_SERIALIZE;
>         }
> @@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
>                         tmp->parent = parent->parent;
>                         if (parent->parent)
>                                 parent->parent->child = tmp;
> +                       /*
> +                        * Transfer SD_PREFER_SIBLING down in case of a
> +                        * degenerate parent; the spans match for this
> +                        * so the property transfers.
> +                        */
> +                       if (parent->flags & SD_PREFER_SIBLING)
> +                               tmp->flags |= SD_PREFER_SIBLING;
>                         destroy_sched_domain(parent, cpu);
>                 } else
>                         tmp = tmp->parent;
>

Reviewed-by: Paul Turner <pjt@google.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat
  2013-08-19 16:01 ` [PATCH 03/10] sched: Clean-up struct sd_lb_stat Peter Zijlstra
  2013-08-24 10:09   ` Paul Turner
@ 2013-08-26  2:56   ` Lei Wen
  2013-08-26  4:36     ` Paul Turner
  1 sibling, 1 reply; 47+ messages in thread
From: Lei Wen @ 2013-08-26  2:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Preeti U Murthy, Vincent Guittot,
	Morten Rasmussen, Namhyung Kim, Lei Wen, Rik van Riel,
	Joonsoo Kim

On Tue, Aug 20, 2013 at 12:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> There is no reason to maintain separate variables for this_group
> and busiest_group in sd_lb_stat, except saving some space.
> But this structure is always allocated in stack, so this saving
> isn't really benificial [peterz: reducing stack space is good; in this
> case readability increases enough that I think its still beneficial]
>
> This patch unify these variables, so IMO, readability may be improved.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> [peterz: lots of style edits, a few fixes and a rename]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com
> ---
>  kernel/sched/fair.c |  225 +++++++++++++++++++++++++---------------------------
>  1 file changed, 112 insertions(+), 113 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct
>
[snip]...
> -       env->imbalance = DIV_ROUND_CLOSEST(
> -               sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
> +       env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
> +                               sds->busiest->sgp->power, SCHED_POWER_SCALE);
>

I am wondering whether we could change this line as below is more appropriate,
since it would avoid the division here:
       env->imbalance = (sds->busiest_stat.avg_load * sds->busiest->sgp->power)
                                      >> SCHED_POWER_SHIFT;

I am not sure whether compiler would be smarter enough to covert into
>> operation,
if it see SCHED_POWER_SCALE is 1024 here.

Thanks,
Lei

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat
  2013-08-26  2:56   ` Lei Wen
@ 2013-08-26  4:36     ` Paul Turner
  2013-08-26  8:42       ` Lei Wen
  0 siblings, 1 reply; 47+ messages in thread
From: Paul Turner @ 2013-08-26  4:36 UTC (permalink / raw)
  To: Lei Wen
  Cc: Peter Zijlstra, Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith,
	Alex Shi, Preeti U Murthy, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Rik van Riel, Joonsoo Kim

On Sun, Aug 25, 2013 at 7:56 PM, Lei Wen <adrian.wenl@gmail.com> wrote:
> On Tue, Aug 20, 2013 at 12:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>> There is no reason to maintain separate variables for this_group
>> and busiest_group in sd_lb_stat, except saving some space.
>> But this structure is always allocated in stack, so this saving
>> isn't really benificial [peterz: reducing stack space is good; in this
>> case readability increases enough that I think its still beneficial]
>>
>> This patch unify these variables, so IMO, readability may be improved.
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> [peterz: lots of style edits, a few fixes and a rename]
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com
>> ---
>>  kernel/sched/fair.c |  225 +++++++++++++++++++++++++---------------------------
>>  1 file changed, 112 insertions(+), 113 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct
>>
> [snip]...
>> -       env->imbalance = DIV_ROUND_CLOSEST(
>> -               sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
>> +       env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
>> +                               sds->busiest->sgp->power, SCHED_POWER_SCALE);
>>
>
> I am wondering whether we could change this line as below is more appropriate,
> since it would avoid the division here:
>        env->imbalance = (sds->busiest_stat.avg_load * sds->busiest->sgp->power)
>                                       >> SCHED_POWER_SHIFT;
>
> I am not sure whether compiler would be smarter enough to covert into
>>> operation,
> if it see SCHED_POWER_SCALE is 1024 here.

This would change the rounding.  Fortunately, gcc is smart enough to
handle this.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat
  2013-08-26  4:36     ` Paul Turner
@ 2013-08-26  8:42       ` Lei Wen
  0 siblings, 0 replies; 47+ messages in thread
From: Lei Wen @ 2013-08-26  8:42 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith,
	Alex Shi, Preeti U Murthy, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 26, 2013 at 12:36 PM, Paul Turner <pjt@google.com> wrote:
> On Sun, Aug 25, 2013 at 7:56 PM, Lei Wen <adrian.wenl@gmail.com> wrote:
>> On Tue, Aug 20, 2013 at 12:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>
>>> There is no reason to maintain separate variables for this_group
>>> and busiest_group in sd_lb_stat, except saving some space.
>>> But this structure is always allocated in stack, so this saving
>>> isn't really benificial [peterz: reducing stack space is good; in this
>>> case readability increases enough that I think its still beneficial]
>>>
>>> This patch unify these variables, so IMO, readability may be improved.
>>>
>>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> [peterz: lots of style edits, a few fixes and a rename]
>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com
>>> ---
>>>  kernel/sched/fair.c |  225 +++++++++++++++++++++++++---------------------------
>>>  1 file changed, 112 insertions(+), 113 deletions(-)
>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct
>>>
>> [snip]...
>>> -       env->imbalance = DIV_ROUND_CLOSEST(
>>> -               sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
>>> +       env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
>>> +                               sds->busiest->sgp->power, SCHED_POWER_SCALE);
>>>
>>
>> I am wondering whether we could change this line as below is more appropriate,
>> since it would avoid the division here:
>>        env->imbalance = (sds->busiest_stat.avg_load * sds->busiest->sgp->power)
>>                                       >> SCHED_POWER_SHIFT;
>>
>> I am not sure whether compiler would be smarter enough to covert into
>>>> operation,
>> if it see SCHED_POWER_SCALE is 1024 here.
>
> This would change the rounding.  Fortunately, gcc is smart enough to
> handle this.

Indeed, I check the assembly code, and it is really smart to do the changes.

Thanks,
Lei

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat
  2013-08-24 10:09   ` Paul Turner
@ 2013-08-26 11:38     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-26 11:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Sat, Aug 24, 2013 at 03:09:38AM -0700, Paul Turner wrote:
> On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > +       struct sg_lb_stats *this, *busiest;
> 
> "this" is a little confusing to read; mainly because elsewhere we've
> tied this to "this cpu" whereas the local sched group is arger.  (Not
> to mention the obvious OOP-land overloading of "this->".)
> 
> Perhaps %s/this/local/ for sg_lb_stat references?  Including this_stat
> -> local_stat on sd_lb_stats?

fair enough, I'll edit the thing to be local.

> > @@ -4952,15 +4950,16 @@ find_busiest_group(struct lb_env *env)
> >                  * there is no imbalance between this and busiest group
> >                  * wrt to idle cpu's, it is balanced.
> >                  */
> > -               if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
> > -                   sds.busiest_nr_running <= sds.busiest_group_weight)
> > +               if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
> > +                   busiest->sum_nr_running <= busiest->group_weight)
> 
> While we're improving readability:  idle_cpus < busiest->idle_cpus ?

Right, took that.

> This check has always been a little oddly asymmetric in that:
>   group_weight - sum_nr_running <= idle_cpus
> 
> This allows the case where our group has pulled lots of work to one of
> its cpus, but not yet spread that out, but still keep trying to
> balance because idle_cpus is high.
> 
> This is more food for thought since this patch is not changing functionality.

Right, I saw the same and made a note to look at it later. I suppose
later never happens though :/ 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games
  2013-08-24 10:15   ` Paul Turner
@ 2013-08-26 11:46     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-26 11:46 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Sat, Aug 24, 2013 at 03:15:57AM -0700, Paul Turner wrote:
> > +static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> > +{
> > +       /*
> > +        * struct sd_lb_stats {
> > +        *         struct sched_group *       busiest;             //     0  8
> > +        *         struct sched_group *       this;                //     8  8
> > +        *         long unsigned int          total_load;          //    16  8
> > +        *         long unsigned int          total_pwr;           //    24  8
> > +        *         long unsigned int          avg_load;            //    32  8
> > +        *         struct sg_lb_stats {
> > +        *                 long unsigned int  avg_load;            //    40  8
> > +        *                 long unsigned int  group_load;          //    48  8
> > +        *                 ...
> > +        *         } busiest_stat;                                 //    40 56
> > +        *         struct sg_lb_stats         this_stat;           //    96 56
> > +        *
> > +        *         // size: 152, cachelines: 3, members: 7
> > +        *         // last cacheline: 24 bytes
> > +        * };
> > +        *
> > +        * Skimp on the clearing to avoid duplicate work. We can avoid clearing
> > +        * this_stat because update_sg_lb_stats() does a full clear/assignment.
> > +        * We must however clear busiest_stat::avg_load because
> > +        * update_sd_pick_busiest() reads this before assignment.
> > +        */
> 
> Does gcc seriously not get this right if we just write:
>   sds->busiest = NULL;
>   sds->local = NULL;
>   ....
> 
> etc.?

There's a thought ;-) How about something like this?

 sds = (struct sd_lb_stats){
 	.busiest = NULL,
	.local = NULL,

	.busiest_stat = {
		.avg_load = 0,
	},
 };

C99 named initializers should ignore all unmentioned fields, and I think
we don't actually touch any of the other fields unless something is set.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-24 10:33   ` Paul Turner
@ 2013-08-26 12:07     ` Peter Zijlstra
  2013-08-27  9:13       ` Paul Turner
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-26 12:07 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Sat, Aug 24, 2013 at 03:33:59AM -0700, Paul Turner wrote:
> On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > +++ b/kernel/sched/fair.c
> > @@ -4977,7 +4977,7 @@ static struct rq *find_busiest_queue(str
> >         unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
> >         int i;
> >
> > -       for_each_cpu(i, sched_group_cpus(group)) {
> > +       for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
> >                 unsigned long power = power_of(i);
> >                 unsigned long capacity = DIV_ROUND_CLOSEST(power,
> >                                                            SCHED_POWER_SCALE);
> > @@ -4986,9 +4986,6 @@ static struct rq *find_busiest_queue(str
> >                 if (!capacity)
> >                         capacity = fix_small_capacity(env->sd, group);
> >
> > -               if (!cpumask_test_cpu(i, env->cpus))
> > -                       continue;
> > -
> >                 rq = cpu_rq(i);
> >                 wl = weighted_cpuload(i);
> 
> There's no need to actually do the divisions immediately below this also.
> 
> e.g.
>   unsigned long max_load_power = SCHED_POWER_SCALE;
>   ...
>   if (wl * max_load_power > max_load * power) {
>    max_load = wl;
>    max_load_power = power;
>    ...
> 
> This would actually end up being a little more accurate even.
> 
> [ Alternatively without caching max_load_power we could compare wl *
> power vs max_load * SCHED_POWER_SCALE. ]

You've got me confused again. You're talking about something like the
below?

I suppose the problem with that is that we could keep selecting the
busiest rq with an unmovable task due to:

move_tasks():
		if ((load / 2) > env->imbalance)
			goto next;

That said, the condition in fbq() should at least be modified to match
this. Now the entire capacity crap comes from:

  bdb94aa sched: Try to deal with low capacity

But thinking a little more about it, if power drops that low imbalance
is likely to be _huge_ and we'd not meet that condition. Now if only I
wrote a more comprehensive Changelog and explained why that wouldn't be
the case. /me kicks himself.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4990,28 +4990,12 @@ static struct sched_group *find_busiest_
 static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
-	struct rq *busiest = NULL, *rq;
 	unsigned long busiest_load = 0, busiest_power = 1;
+	struct rq *busiest = NULL;
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power,
-							   SCHED_POWER_SCALE);
-		unsigned long wl;
-
-		if (!capacity)
-			capacity = fix_small_capacity(env->sd, group);
-
-		rq = cpu_rq(i);
-		wl = weighted_cpuload(i);
-
-		/*
-		 * When comparing with imbalance, use weighted_cpuload()
-		 * which is not scaled with the cpu power.
-		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
-			continue;
+		unsigned long wl = weighted_cpuload(i);
 
 		/*
 		 * For the load comparisons with the other cpu's, consider
@@ -5027,7 +5011,7 @@ static struct rq *find_busiest_queue(str
 		if (wl * busiest_power > busiest_load * power) {
 			busiest_load = wl;
 			busiest_power = power;
-			busiest = rq;
+			busiest = cpu_rq(i);
 		}
 	}
 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code
  2013-08-24 10:45   ` Paul Turner
@ 2013-08-26 12:09     ` Peter Zijlstra
  2013-08-26 21:49       ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-26 12:09 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Sat, Aug 24, 2013 at 03:45:57AM -0700, Paul Turner wrote:
> > @@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
> >                         tmp->parent = parent->parent;
> >                         if (parent->parent)
> >                                 parent->parent->child = tmp;
> > +                       /*
> > +                        * Transfer SD_PREFER_SIBLING down in case of a
> > +                        * degenerate parent; the spans match for this
> > +                        * so the property transfers.
> > +                        */
> > +                       if (parent->flags & SD_PREFER_SIBLING)
> > +                               tmp->flags |= SD_PREFER_SIBLING;
> >                         destroy_sched_domain(parent, cpu);
> >                 } else
> >                         tmp = tmp->parent;
> >
> 
> Reviewed-by: Paul Turner <pjt@google.com>

BTW, did that comment make sense to you or would you suggest something
different? I had/am having a hard time with that comment. Somehow it
leaves me wanting. I know I understand the issue now, but I'll doubt the
comment will suffice in a years time :/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code
  2013-08-26 12:09     ` Peter Zijlstra
@ 2013-08-26 21:49       ` Rik van Riel
  2013-08-27  9:05         ` Paul Turner
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2013-08-26 21:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith,
	Alex Shi, Preeti U Murthy, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Joonsoo Kim

On 08/26/2013 08:09 AM, Peter Zijlstra wrote:
> On Sat, Aug 24, 2013 at 03:45:57AM -0700, Paul Turner wrote:
>>> @@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
>>>                         tmp->parent = parent->parent;
>>>                         if (parent->parent)
>>>                                 parent->parent->child = tmp;
>>> +                       /*
>>> +                        * Transfer SD_PREFER_SIBLING down in case of a
>>> +                        * degenerate parent; the spans match for this
>>> +                        * so the property transfers.
>>> +                        */
>>> +                       if (parent->flags & SD_PREFER_SIBLING)
>>> +                               tmp->flags |= SD_PREFER_SIBLING;
>>>                         destroy_sched_domain(parent, cpu);
>>>                 } else
>>>                         tmp = tmp->parent;
>>>
>>
>> Reviewed-by: Paul Turner <pjt@google.com>
> 
> BTW, did that comment make sense to you or would you suggest something
> different? I had/am having a hard time with that comment. Somehow it
> leaves me wanting. I know I understand the issue now, but I'll doubt the
> comment will suffice in a years time :/

The comment made sense to me :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code
  2013-08-26 21:49       ` Rik van Riel
@ 2013-08-27  9:05         ` Paul Turner
  0 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2013-08-27  9:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith,
	Alex Shi, Preeti U Murthy, Vincent Guittot, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Joonsoo Kim

On Mon, Aug 26, 2013 at 2:49 PM, Rik van Riel <riel@surriel.com> wrote:
> On 08/26/2013 08:09 AM, Peter Zijlstra wrote:
>> On Sat, Aug 24, 2013 at 03:45:57AM -0700, Paul Turner wrote:
>>>> @@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
>>>>                         tmp->parent = parent->parent;
>>>>                         if (parent->parent)
>>>>                                 parent->parent->child = tmp;
>>>> +                       /*
>>>> +                        * Transfer SD_PREFER_SIBLING down in case of a
>>>> +                        * degenerate parent; the spans match for this
>>>> +                        * so the property transfers.
>>>> +                        */
>>>> +                       if (parent->flags & SD_PREFER_SIBLING)
>>>> +                               tmp->flags |= SD_PREFER_SIBLING;
>>>>                         destroy_sched_domain(parent, cpu);
>>>>                 } else
>>>>                         tmp = tmp->parent;
>>>>
>>>
>>> Reviewed-by: Paul Turner <pjt@google.com>
>>
>> BTW, did that comment make sense to you or would you suggest something
>> different? I had/am having a hard time with that comment. Somehow it
>> leaves me wanting. I know I understand the issue now, but I'll doubt the
>> comment will suffice in a years time :/
>
> The comment made sense to me :)

It makes sense once you read the code.  That we push down is somehow
counter intuitive in the reading.

>
> --
> All rights reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
  2013-08-26 12:07     ` Peter Zijlstra
@ 2013-08-27  9:13       ` Paul Turner
  0 siblings, 0 replies; 47+ messages in thread
From: Paul Turner @ 2013-08-27  9:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, LKML, Mike Galbraith, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Rik van Riel, Joonsoo Kim

On Mon, Aug 26, 2013 at 5:07 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Aug 24, 2013 at 03:33:59AM -0700, Paul Turner wrote:
>> On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > +++ b/kernel/sched/fair.c
>> > @@ -4977,7 +4977,7 @@ static struct rq *find_busiest_queue(str
>> >         unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
>> >         int i;
>> >
>> > -       for_each_cpu(i, sched_group_cpus(group)) {
>> > +       for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>> >                 unsigned long power = power_of(i);
>> >                 unsigned long capacity = DIV_ROUND_CLOSEST(power,
>> >                                                            SCHED_POWER_SCALE);
>> > @@ -4986,9 +4986,6 @@ static struct rq *find_busiest_queue(str
>> >                 if (!capacity)
>> >                         capacity = fix_small_capacity(env->sd, group);
>> >
>> > -               if (!cpumask_test_cpu(i, env->cpus))
>> > -                       continue;
>> > -
>> >                 rq = cpu_rq(i);
>> >                 wl = weighted_cpuload(i);
>>
>> There's no need to actually do the divisions immediately below this also.
>>
>> e.g.
>>   unsigned long max_load_power = SCHED_POWER_SCALE;
>>   ...
>>   if (wl * max_load_power > max_load * power) {
>>    max_load = wl;
>>    max_load_power = power;
>>    ...
>>
>> This would actually end up being a little more accurate even.
>>
>> [ Alternatively without caching max_load_power we could compare wl *
>> power vs max_load * SCHED_POWER_SCALE. ]
>
> You've got me confused again. You're talking about something like the
> below?

Nevermind, I was looking at a tip tree as I reviewed this one.  What I
was suggesting was exactly:
  "[PATCH 01/10] sched: Remove one division operation in find_busiest_queue()"

>
> I suppose the problem with that is that we could keep selecting the
> busiest rq with an unmovable task due to:
>
> move_tasks():
>                 if ((load / 2) > env->imbalance)
>                         goto next;
>
> That said, the condition in fbq() should at least be modified to match
> this. Now the entire capacity crap comes from:
>
>   bdb94aa sched: Try to deal with low capacity
>
> But thinking a little more about it, if power drops that low imbalance
> is likely to be _huge_ and we'd not meet that condition. Now if only I
> wrote a more comprehensive Changelog and explained why that wouldn't be
> the case. /me kicks himself.
>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4990,28 +4990,12 @@ static struct sched_group *find_busiest_
>  static struct rq *find_busiest_queue(struct lb_env *env,
>                                      struct sched_group *group)
>  {
> -       struct rq *busiest = NULL, *rq;
>         unsigned long busiest_load = 0, busiest_power = 1;
> +       struct rq *busiest = NULL;
>         int i;
>
>         for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
> -               unsigned long power = power_of(i);
> -               unsigned long capacity = DIV_ROUND_CLOSEST(power,
> -                                                          SCHED_POWER_SCALE);
> -               unsigned long wl;
> -
> -               if (!capacity)
> -                       capacity = fix_small_capacity(env->sd, group);
> -
> -               rq = cpu_rq(i);
> -               wl = weighted_cpuload(i);
> -
> -               /*
> -                * When comparing with imbalance, use weighted_cpuload()
> -                * which is not scaled with the cpu power.
> -                */
> -               if (capacity && rq->nr_running == 1 && wl > env->imbalance)
> -                       continue;
> +               unsigned long wl = weighted_cpuload(i);
>
>                 /*
>                  * For the load comparisons with the other cpu's, consider
> @@ -5027,7 +5011,7 @@ static struct rq *find_busiest_queue(str
>                 if (wl * busiest_power > busiest_load * power) {
>                         busiest_load = wl;
>                         busiest_power = power;
> -                       busiest = rq;
> +                       busiest = cpu_rq(i);
>                 }
>         }
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 11/10] sched, fair: Reduce local_group logic
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2013-08-21  2:09 ` [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Joonsoo Kim
@ 2013-08-28  8:55 ` Peter Zijlstra
  2013-08-28  8:57   ` Peter Zijlstra
  2013-08-28  9:16   ` Peter Zijlstra
  2013-08-28 11:14 ` [PATCH 12/10] sched, fair: Fix group power_orig computation Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-28  8:55 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel


Subject: sched, fair: Reduce local_group logic
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 28 10:32:32 CEST 2013

Try and reduce the local_group logic by pulling most of it into
update_sd_lb_stats.

We could completely get rid of the local_group variable, but that
results in two calls to update_sg_lb_stats() and that blows up the
object size by about 2k :/

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4561,6 +4561,8 @@ static inline void update_sg_lb_stats(st
 	unsigned long load;
 	int i;
 
+	memset(sgs, 0, sizeof(*sgs));
+
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
@@ -4579,10 +4581,6 @@ static inline void update_sg_lb_stats(st
 			sgs->idle_cpus++;
 	}
 
-	if (local_group && (env->idle != CPU_NEWLY_IDLE ||
-			time_after_eq(jiffies, group->sgp->next_update)))
-		update_group_power(env->sd, env->dst_cpu);
-
 	/* Adjust by relative CPU power of the group */
 	sgs->group_power = group->sgp->power;
 	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
@@ -4675,11 +4673,17 @@ static inline void update_sd_lb_stats(st
 		if (local_group) {
 			sds->local = sg;
 			sgs = &sds->local_stat;
+
+			if (env->idle != CPU_NEWLY_IDLE ||
+			    time_after_eq(jiffies, sg->sgp->next_update))
+				update_group_power(env->sd, env->dst_cpu);
 		}
 
-		memset(sgs, 0, sizeof(*sgs));
 		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
+		if (local_group)
+			goto next_group;
+
 		/*
 		 * In case the child domain prefers tasks go to siblings
 		 * first, lower the sg capacity to one so that we'll try
@@ -4690,19 +4694,20 @@ static inline void update_sd_lb_stats(st
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
-		if (prefer_sibling && !local_group &&
-				sds->local && sds->local_stat.group_has_capacity)
+		if (prefer_sibling && sds->local &&
+		    sds->local_stat.group_has_capacity)
 			sgs->group_capacity = min(sgs->group_capacity, 1U);
 
-		/* Now, start updating sd_lb_stats */
-		sds->total_load += sgs->group_load;
-		sds->total_pwr += sgs->group_power;
-
-		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
+		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
 			sds->busiest_stat = *sgs;
 		}
 
+next_group:
+		/* Now, start updating sd_lb_stats */
+		sds->total_load += sgs->group_load;
+		sds->total_pwr += sgs->group_power;
+
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 11/10] sched, fair: Reduce local_group logic
  2013-08-28  8:55 ` [RFC][PATCH 11/10] sched, fair: Reduce local_group logic Peter Zijlstra
@ 2013-08-28  8:57   ` Peter Zijlstra
  2013-08-28  9:16   ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-28  8:57 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel


# ls -la defconfig-build/kernel/sched/fair.o*
-rw-r--r-- 1 root root 760688 Aug 28 10:53 defconfig-build/kernel/sched/fair.o
-rw-r--r-- 1 root root 758160 Aug 28 10:36 defconfig-build/kernel/sched/fair.o.orig

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4553,9 +4553,9 @@ static inline int sg_imbalanced(struct s
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
  */
-static inline void update_sg_lb_stats(struct lb_env *env,
+static void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
-			int local_group, struct sg_lb_stats *sgs)
+			int group_bias, struct sg_lb_stats *sgs)
 {
 	unsigned long nr_running;
 	unsigned long load;
@@ -4568,8 +4568,7 @@ static inline void update_sg_lb_stats(st
 
 		nr_running = rq->nr_running;
 
-		/* Bias balancing toward cpus of our domain */
-		if (local_group)
+		if (group_bias)
 			load = target_load(i, load_idx);
 		else
 			load = source_load(i, load_idx);
@@ -4667,22 +4666,21 @@ static inline void update_sd_lb_stats(st
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
-		int local_group;
 
-		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
-		if (local_group) {
+		if (cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg))) {
 			sds->local = sg;
 			sgs = &sds->local_stat;
 
 			if (env->idle != CPU_NEWLY_IDLE ||
 			    time_after_eq(jiffies, sg->sgp->next_update))
 				update_group_power(env->sd, env->dst_cpu);
-		}
-
-		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
-		if (local_group)
+			/* Bias balancing toward cpus of our domain */
+			update_sg_lb_stats(env, sg, load_idx, 1, sgs);
 			goto next_group;
+		}
+
+		update_sg_lb_stats(env, sg, load_idx, 0, sgs);
 
 		/*
 		 * In case the child domain prefers tasks go to siblings

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 11/10] sched, fair: Reduce local_group logic
  2013-08-28  8:55 ` [RFC][PATCH 11/10] sched, fair: Reduce local_group logic Peter Zijlstra
  2013-08-28  8:57   ` Peter Zijlstra
@ 2013-08-28  9:16   ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-28  9:16 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel

On Wed, Aug 28, 2013 at 10:55:42AM +0200, Peter Zijlstra wrote:
> @@ -4690,19 +4694,20 @@ static inline void update_sd_lb_stats(st
>  		 * heaviest group when it is already under-utilized (possible
>  		 * with a large weight task outweighs the tasks on the system).
>  		 */
> +		if (prefer_sibling && sds->local &&
> +		    sds->local_stat.group_has_capacity)
>  			sgs->group_capacity = min(sgs->group_capacity, 1U);

While we're here, I think its always true that sds->local is set,
because env->dst_cpu should always be part of the local group and the
local group is always sd->groups and since that now directly skips to
next_group we'll not get here without this being true.

Hmm?

>  
> +		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>  			sds->busiest = sg;
>  			sds->busiest_stat = *sgs;
>  		}
>  
> +next_group:
> +		/* Now, start updating sd_lb_stats */
> +		sds->total_load += sgs->group_load;
> +		sds->total_pwr += sgs->group_power;
> +
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 12/10] sched, fair: Fix group power_orig computation
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2013-08-28  8:55 ` [RFC][PATCH 11/10] sched, fair: Reduce local_group logic Peter Zijlstra
@ 2013-08-28 11:14 ` Peter Zijlstra
  2013-08-28 11:15 ` [PATCH 13/10] sched, fair: Rework and comment the group_capacity code Peter Zijlstra
  2013-08-28 11:16 ` [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation Peter Zijlstra
  14 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-28 11:14 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel, Michael Neuling


Subject: sched, fair: Fix group power_orig computation
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 28 11:44:39 CEST 2013

When looking at the code I noticed we don't actually compute
sgp->power_orig correctly for groups, fix that.

Currently the only consumer of that value is fix_small_capacity()
which is only used on POWER7+ and that code excludes this case by
being limited to SD_SHARE_CPUPOWER which is only ever set on the SMT
domain which must be the lowest domain and this has singleton groups.

So nothing should be affected by this change.

Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4448,7 +4448,7 @@ void update_group_power(struct sched_dom
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long power;
+	unsigned long power, power_orig;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -4460,7 +4460,7 @@ void update_group_power(struct sched_dom
 		return;
 	}
 
-	power = 0;
+	power_orig = power = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -4468,8 +4468,12 @@ void update_group_power(struct sched_dom
 		 * span the current group.
 		 */
 
-		for_each_cpu(cpu, sched_group_cpus(sdg))
-			power += power_of(cpu);
+		for_each_cpu(cpu, sched_group_cpus(sdg)) {
+			struct sched_group *sg = cpu_rq(cpu)->sd->groups;
+
+			power_orig += sg->sgp->power_orig;
+			power += sg->sgp->power;
+		}
 	} else  {
 		/*
 		 * !SD_OVERLAP domains can assume that child groups
@@ -4478,12 +4482,14 @@ void update_group_power(struct sched_dom
 
 		group = child->groups;
 		do {
+			power_orig += group->sgp->power_orig;
 			power += group->sgp->power;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgp->power_orig = sdg->sgp->power = power;
+	sdg->sgp->power_orig = power_orig;
+	sdg->sgp->power = power;
 }
 
 /*

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 13/10] sched, fair: Rework and comment the group_capacity code
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2013-08-28 11:14 ` [PATCH 12/10] sched, fair: Fix group power_orig computation Peter Zijlstra
@ 2013-08-28 11:15 ` Peter Zijlstra
  2013-08-28 11:16 ` [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation Peter Zijlstra
  14 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-28 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel


Subject: sched, fair: Rework and comment the group_capacity code
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 28 11:50:34 CEST 2013

Pull out the group_capacity computation so that we can more clearly
comment its issues.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4551,6 +4551,27 @@ static inline int sg_imbalanced(struct s
 	return group->sgp->imbalance;
 }
 
+/*
+ * Compute the group capacity.
+ *
+ * For now the capacity is simply the number of power units in the group_power.
+ * A power unit represents a full core.
+ *
+ * This has an issue where N*frac(smt_power) >= 1, in that case we'll see extra
+ * 'cores' that aren't actually there.
+ */
+static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
+{
+
+	unsigned int power = group->sgp->power;
+	unsigned int capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
+
+	if (!capacity)
+		capacity = fix_small_capacity(env->sd, group);
+
+	return capacity;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -4594,16 +4615,11 @@ static inline void update_sg_lb_stats(st
 	if (sgs->sum_nr_running)
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	sgs->group_imb = sg_imbalanced(group);
-
-	sgs->group_capacity =
-		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
-
-	if (!sgs->group_capacity)
-		sgs->group_capacity = fix_small_capacity(env->sd, group);
-
 	sgs->group_weight = group->group_weight;
 
+	sgs->group_imb = sg_imbalanced(group);
+	sgs->group_capacity = sg_capacity(env, group);
+
 	if (sgs->group_capacity > sgs->sum_nr_running)
 		sgs->group_has_capacity = 1;
 }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation
  2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2013-08-28 11:15 ` [PATCH 13/10] sched, fair: Rework and comment the group_capacity code Peter Zijlstra
@ 2013-08-28 11:16 ` Peter Zijlstra
  2013-09-04  7:44   ` Vincent Guittot
  14 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-08-28 11:16 UTC (permalink / raw)
  To: Ingo Molnar, Joonsoo Kim
  Cc: linux-kernel, Mike Galbraith, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Lei Wen, Joonsoo Kim, Rik van Riel


Subject: sched, fair: Fix the group_capacity computation
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 28 12:40:38 CEST 2013

Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting
the capacity to the actual number of cores.

The assumption of 1 < smt_power < 2 is an actual requirement because
of what SMT is so this should work regardless of the SMT
implementation.

It can still be defeated by creative use of cpu hotplug, but if you're
one of those freaks, you get to live with it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4554,18 +4554,24 @@ static inline int sg_imbalanced(struct s
 /*
  * Compute the group capacity.
  *
- * For now the capacity is simply the number of power units in the group_power.
- * A power unit represents a full core.
- *
- * This has an issue where N*frac(smt_power) >= 1, in that case we'll see extra
- * 'cores' that aren't actually there.
+ * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by
+ * first dividing out the smt factor and computing the actual number of cores
+ * and limit power unit capacity with that.
  */
 static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
 {
+	unsigned int capacity, smt, cpus;
+	unsigned int power, power_orig;
+
+	power = group->sgp->power;
+	power_orig = group->sgp->power_orig;
+	cpus = group->group_weight;
 
-	unsigned int power = group->sgp->power;
-	unsigned int capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
+	/* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
+	smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
+	capacity = cpus / smt; /* cores */
 
+	capacity = min_t(capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
 	if (!capacity)
 		capacity = fix_small_capacity(env->sd, group);
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation
  2013-08-28 11:16 ` [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation Peter Zijlstra
@ 2013-09-04  7:44   ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2013-09-04  7:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joonsoo Kim, linux-kernel, Mike Galbraith,
	Paul Turner, Alex Shi, Preeti U Murthy, Morten Rasmussen,
	Namhyung Kim, Lei Wen, Joonsoo Kim, Rik van Riel

On 28 August 2013 13:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Subject: sched, fair: Fix the group_capacity computation
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Aug 28 12:40:38 CEST 2013
>
> Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting
> the capacity to the actual number of cores.
>

Peter,

your patch also solves the 'phantom' big cores that can appear on HMP
system because big cores have a cpu_power >=  SCHED_POWER_SCALE in
order to express a higher capacity than LITTLE cores.

Acked-by Vincent Guittot <vincent.guitto@linaro.org>

Vincent

> The assumption of 1 < smt_power < 2 is an actual requirement because
> of what SMT is so this should work regardless of the SMT
> implementation.
>
> It can still be defeated by creative use of cpu hotplug, but if you're
> one of those freaks, you get to live with it.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4554,18 +4554,24 @@ static inline int sg_imbalanced(struct s
>  /*
>   * Compute the group capacity.
>   *
> - * For now the capacity is simply the number of power units in the group_power.
> - * A power unit represents a full core.
> - *
> - * This has an issue where N*frac(smt_power) >= 1, in that case we'll see extra
> - * 'cores' that aren't actually there.
> + * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by
> + * first dividing out the smt factor and computing the actual number of cores
> + * and limit power unit capacity with that.
>   */
>  static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
>  {
> +       unsigned int capacity, smt, cpus;
> +       unsigned int power, power_orig;
> +
> +       power = group->sgp->power;
> +       power_orig = group->sgp->power_orig;
> +       cpus = group->group_weight;
>
> -       unsigned int power = group->sgp->power;
> -       unsigned int capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
> +       /* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
> +       smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
> +       capacity = cpus / smt; /* cores */
>
> +       capacity = min_t(capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
>         if (!capacity)
>                 capacity = fix_small_capacity(env->sd, group);
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-09-04  7:44 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-19 16:00 [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Peter Zijlstra
2013-08-19 16:00 ` [PATCH 01/10] sched: Remove one division operation in find_busiest_queue() Peter Zijlstra
2013-08-22  8:58   ` Paul Turner
2013-08-22 10:25     ` Peter Zijlstra
2013-08-19 16:01 ` [PATCH 02/10] sched: Factor out code to should_we_balance() Peter Zijlstra
2013-08-22  9:58   ` Paul Turner
2013-08-22 10:42     ` Peter Zijlstra
2013-08-23  4:51       ` Joonsoo Kim
2013-08-23 11:37       ` Paul Turner
2013-08-19 16:01 ` [PATCH 03/10] sched: Clean-up struct sd_lb_stat Peter Zijlstra
2013-08-24 10:09   ` Paul Turner
2013-08-26 11:38     ` Peter Zijlstra
2013-08-26  2:56   ` Lei Wen
2013-08-26  4:36     ` Paul Turner
2013-08-26  8:42       ` Lei Wen
2013-08-19 16:01 ` [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games Peter Zijlstra
2013-08-21  2:08   ` Joonsoo Kim
2013-08-21  2:20     ` Joonsoo Kim
2013-08-21  8:38       ` Peter Zijlstra
2013-08-21  8:35     ` Peter Zijlstra
2013-08-24 10:15   ` Paul Turner
2013-08-26 11:46     ` Peter Zijlstra
2013-08-19 16:01 ` [PATCH 05/10] sched, fair: Remove duplicate load_per_task computations Peter Zijlstra
2013-08-19 16:01 ` [PATCH 06/10] sched, fair: Make group power more consitent Peter Zijlstra
2013-08-23  3:40   ` Preeti U Murthy
2013-08-19 16:01 ` [PATCH 07/10] sched, fair: Optimize find_busiest_queue() Peter Zijlstra
2013-08-23  8:11   ` Preeti U Murthy
2013-08-23 10:03     ` Peter Zijlstra
2013-08-23 10:54       ` Preeti U Murthy
2013-08-24 10:33   ` Paul Turner
2013-08-26 12:07     ` Peter Zijlstra
2013-08-27  9:13       ` Paul Turner
2013-08-19 16:01 ` [PATCH 08/10] sched, fair: Rework and comment the group_imb code Peter Zijlstra
2013-08-19 16:01 ` [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code Peter Zijlstra
2013-08-24 10:45   ` Paul Turner
2013-08-26 12:09     ` Peter Zijlstra
2013-08-26 21:49       ` Rik van Riel
2013-08-27  9:05         ` Paul Turner
2013-08-19 16:01 ` [RFC][PATCH 10/10] sched, fair: Rewrite group_imb trigger Peter Zijlstra
2013-08-21  2:09 ` [PATCH 00/10] Various load-balance cleanups/optimizations -v2 Joonsoo Kim
2013-08-28  8:55 ` [RFC][PATCH 11/10] sched, fair: Reduce local_group logic Peter Zijlstra
2013-08-28  8:57   ` Peter Zijlstra
2013-08-28  9:16   ` Peter Zijlstra
2013-08-28 11:14 ` [PATCH 12/10] sched, fair: Fix group power_orig computation Peter Zijlstra
2013-08-28 11:15 ` [PATCH 13/10] sched, fair: Rework and comment the group_capacity code Peter Zijlstra
2013-08-28 11:16 ` [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation Peter Zijlstra
2013-09-04  7:44   ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.