All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] sched/fair: fixes and cleanups
@ 2016-04-29 19:32 Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 1/7] sched/fair: Remove remaining power aware scheduling comments Dietmar Eggemann
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

This is a collection of rather loosely coupled small changes to the cfs 
scheduler related to: 

Comments (patch 1/7, 2/7):

  Remove/fix some outdated comments related to 'power aware scheduling'
  and smpnice.

Load-balance (patch 3/7 - 5/7):

  Change unit of load_above_capacity in calculate_imbalance() from
  [load] to [load/capacity]

  Refactoring logic in fix_small_imbalance() and remove
  cpu_avg_load_per_task().

Misc (patch 6/7, 7/7):

  Reorder update_sd_lb_stats() and use group_cfs_rq(se) throughout
  fair.c.

Test-results:

IVB-EP (2*10*2):

  perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000

   Before:                     After:
   4.784556741 ( +- 0.99% )    4.760953662 ( +- 0.47% )

Juno (ARM64) (2B/4l*1):

  perf stat --null --repeat 10 -- perf bench sched messaging -g 20 -l 500

  Before:                     After:
  7.227439638 ( +- 0.50% )    7.151685712 ( +- 0.64% )

Dietmar Eggemann (5):
  sched/fair: Remove remaining power aware scheduling comments
  sched/fair: Fix comment in calculate_imbalance()
  sched/fair: Clean up the logic in fix_small_imbalance()
  sched/fair: Reorder code in update_sd_lb_stats()
  sched/fair: Use group_cfs_rq(se) instead of se->my_q

Morten Rasmussen (2):
  sched/fair: Correct unit of load_above_capacity
  sched/fair: Remove cpu_avg_load_per_task()

 kernel/sched/fair.c | 71 ++++++++++++++++++++---------------------------------
 1 file changed, 26 insertions(+), 45 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] sched/fair: Remove remaining power aware scheduling comments
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  2016-05-05  9:42   ` [tip:sched/core] sched/fair: Remove stale " tip-bot for Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 2/7] sched/fair: Fix comment in calculate_imbalance() Dietmar Eggemann
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

Commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling remnants
and dysfunctional knobs") deleted the power aware scheduling support.

This patch gets rid of the remaining power aware scheduling comments.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8a33abce650..c77e01b55ec2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7025,9 +7025,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * We're trying to get all the cpus to the average_load, so we don't
 	 * want to push ourselves above the average load, nor do we wish to
 	 * reduce the max loaded cpu below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group capacity
-	 * (so that we can implement power-savings policies etc). Thus we look
-	 * for the minimum possible imbalance.
+	 * we also don't want to reduce the group load below the group
+	 * capacity. Thus we look for the minimum possible imbalance.
 	 */
 	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
 
@@ -7051,10 +7050,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
- * if there is an imbalance. If there isn't an imbalance, and
- * the user has opted for power-savings, it returns a group whose
- * CPUs can be put to idle by rebalancing those tasks elsewhere, if
- * such a group exists.
+ * if there is an imbalance.
  *
  * Also calculates the amount of weighted load which should be moved
  * to restore balance.
@@ -7062,9 +7058,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  * @env: The load balancing environment.
  *
  * Return:	- The busiest group if imbalance exists.
- *		- If no imbalance and user has opted for power-savings balance,
- *		   return the least loaded group whose CPUs can be
- *		   put to idle by rebalancing its tasks onto our group.
  */
 static struct sched_group *find_busiest_group(struct lb_env *env)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] sched/fair: Fix comment in calculate_imbalance()
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 1/7] sched/fair: Remove remaining power aware scheduling comments Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  2016-05-05  9:42   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 3/7] sched/fair: Correct unit of load_above_capacity Dietmar Eggemann
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

The comment survived from commit 2dd73a4f09be ("[PATCH] sched:
implement smpnice") in which the selection of busiest and
busiest->avg_load (max_load) was based on if(avg_load > max_load
&& sum_nr_running > group->cpu_power / SCHED_LOAD_SCALE).

Commit b18855500fc4 ("sched/balancing: Fix 'local->avg_load >
sds->avg_load' case in calculate_imbalance()") added the second
operand of the or operator.

Update this comment accordingly and also use the current variable
names.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c77e01b55ec2..bc19fee94f39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6998,9 +6998,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	}
 
 	/*
-	 * In the presence of smp nice balancing, certain scenarios can have
-	 * max load less than avg load(as we skip the groups at or below
-	 * its cpu_capacity, while calculating max_load..)
+	 * Avg load of busiest sg can be less and avg load of local sg can
+	 * be greater than avg load across all sgs of sd because avg load
+	 * factors in sg capacity and sgs with smaller group_type are
+	 * skipped when updating the busiest sg.
 	 */
 	if (busiest->avg_load <= sds->avg_load ||
 	    local->avg_load >= sds->avg_load) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] sched/fair: Correct unit of load_above_capacity
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 1/7] sched/fair: Remove remaining power aware scheduling comments Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 2/7] sched/fair: Fix comment in calculate_imbalance() Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  2016-05-03 10:52   ` Peter Zijlstra
  2016-05-12 10:31   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  2016-04-29 19:32 ` [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance() Dietmar Eggemann
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

From: Morten Rasmussen <morten.rasmussen@arm.com>

In calculate_imbalance() load_above_capacity currently has the unit
[load] while it is used as being [load/capacity]. Not only is it wrong it
also makes it unlikely that load_above_capacity is ever used as the
subsequent code picks the smaller of load_above_capacity and the avg_load
difference between the local and the busiest sched_groups.

This patch ensures that load_above_capacity has the right unit
[load/capacity].

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc19fee94f39..c066574cff04 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
 					SCHED_LOAD_SCALE;
-		if (load_above_capacity > busiest->group_capacity)
+		if (load_above_capacity > busiest->group_capacity) {
 			load_above_capacity -= busiest->group_capacity;
-		else
+			load_above_capacity *= SCHED_LOAD_SCALE;
+			load_above_capacity /= busiest->group_capacity;
+		} else
 			load_above_capacity = ~0UL;
 	}
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance()
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
                   ` (2 preceding siblings ...)
  2016-04-29 19:32 ` [PATCH 3/7] sched/fair: Correct unit of load_above_capacity Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  2016-05-03 10:12   ` Peter Zijlstra
  2016-04-29 19:32 ` [PATCH 5/7] sched/fair: Remove cpu_avg_load_per_task() Dietmar Eggemann
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

Avoid the need to add scaled_busy_load_per_task on both sides of the if
condition to determine whether imbalance has to be set to
busiest->load_per_task or not.

The imbn variable was introduced with commit 2dd73a4f09be ("[PATCH]
sched: implement smpnice") and the original if condition was

  if (max_load - this_load >= busiest_load_per_task * imbn)

which over time changed into the current version where
scaled_busy_load_per_task is to be found on both sides of
the if condition.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

The original smpnice implementation sets imbalance to the avg task
load of the busiest sched group (sg) if the difference between the avg 
load of the busiest sg and the local sg is greater or equal 2 times
(imbn) the avg task load of the busiest sg if there is no task running
in the local sg or the avg task load of the busiest sg is smaller or
equal the avg task load of the local sg. Otherwise the imbn factor is
lowered to 1.

imbn set to 2 makes sense when all the tasks have the same priority so
in case there are n tasks on local sd there have to be n+2 tasks on the 
busiest sg to give load balance the chance to pull over one task.

imbn set to 1 makes sense when the avg task load of the busiest sg is
greater than the one of the local sg since there could be at least one 
task with a smaller load on the busiest sg which can be pulled over to
the local sg.

The current version lowered imbn effectively by one, so we use 1 instead
of 2 in case the avg task load of sg_busiest isn't greater than the one 
of sg_local and 0 instead of 1 in the other case. This behaviour is not in
sync with the explanation above on why we set imbn to certain values.

 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c066574cff04..dc4828bbe50d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6915,7 +6915,7 @@ static inline
 void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	unsigned long tmp, capa_now = 0, capa_move = 0;
-	unsigned int imbn = 2;
+	unsigned int imbn = 1;
 	unsigned long scaled_busy_load_per_task;
 	struct sg_lb_stats *local, *busiest;
 
@@ -6925,13 +6925,13 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	if (!local->sum_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)
-		imbn = 1;
+		imbn = 0;
 
 	scaled_busy_load_per_task =
 		(busiest->load_per_task * SCHED_CAPACITY_SCALE) /
 		busiest->group_capacity;
 
-	if (busiest->avg_load + scaled_busy_load_per_task >=
+	if (busiest->avg_load >=
 	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
 		env->imbalance = busiest->load_per_task;
 		return;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] sched/fair: Remove cpu_avg_load_per_task()
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
                   ` (3 preceding siblings ...)
  2016-04-29 19:32 ` [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance() Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 6/7] sched/fair: Reorder code in update_sd_lb_stats() Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 7/7] sched/fair: Use group_cfs_rq(se) instead of se->my_q Dietmar Eggemann
  6 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

From: Morten Rasmussen <morten.rasmussen@arm.com>

cpu_avg_load_per_task() is called in situations where the local
sched_group currently has no runnable tasks according to the group
statistics (sum of rq->cfs.h_nr_running) to calculate the local
load_per_task based on the destination cpu average load_per_task. Since
group has no tasks neither will env->dst_cpu as it is part of the local
sched_group. In this case the function will always return zero which is
also the value of the local load_per_task.

The only case where cpu_avg_load_per_task() might make a difference is
if another cpu places a task on env->dst_cpu after the group statistics
was gathered but before this bit of code is reached during an
idle_balance().

This race doesn't seem to be taken into account elsewhere, so the
function call doesn't seem to have any effect and should be safe to
remove.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc4828bbe50d..e68b9eb5cb97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4797,18 +4797,6 @@ static unsigned long capacity_orig_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity_orig;
 }
 
-static unsigned long cpu_avg_load_per_task(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
-	unsigned long load_avg = weighted_cpuload(cpu);
-
-	if (nr_running)
-		return load_avg / nr_running;
-
-	return 0;
-}
-
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -6922,9 +6910,8 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (!local->sum_nr_running)
-		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
-	else if (busiest->load_per_task > local->load_per_task)
+	if (local->sum_nr_running &&
+	    busiest->load_per_task > local->load_per_task)
 		imbn = 0;
 
 	scaled_busy_load_per_task =
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] sched/fair: Reorder code in update_sd_lb_stats()
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
                   ` (4 preceding siblings ...)
  2016-04-29 19:32 ` [PATCH 5/7] sched/fair: Remove cpu_avg_load_per_task() Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  2016-04-29 19:32 ` [PATCH 7/7] sched/fair: Use group_cfs_rq(se) instead of se->my_q Dietmar Eggemann
  6 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

Do the update of total load and total capacity of sched_domain
statistics before detecting if it is the local group. This and the
inclusion of sg=sg->next into the condition of the do...while loop
makes the code easier to read.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e68b9eb5cb97..58da724b6ca4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6801,8 +6801,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
 						&overload);
 
+		/* Update sd_lb_stats */
+		sds->total_load += sgs->group_load;
+		sds->total_capacity += sgs->group_capacity;
+
 		if (local_group)
-			goto next_group;
+			continue;
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -6826,13 +6830,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 			sds->busiest_stat = *sgs;
 		}
 
-next_group:
-		/* Now, start updating sd_lb_stats */
-		sds->total_load += sgs->group_load;
-		sds->total_capacity += sgs->group_capacity;
-
-		sg = sg->next;
-	} while (sg != env->sd->groups);
+	} while (sg = sg->next, sg != env->sd->groups);
 
 	if (env->sd->flags & SD_NUMA)
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/7] sched/fair: Use group_cfs_rq(se) instead of se->my_q
  2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
                   ` (5 preceding siblings ...)
  2016-04-29 19:32 ` [PATCH 6/7] sched/fair: Reorder code in update_sd_lb_stats() Dietmar Eggemann
@ 2016-04-29 19:32 ` Dietmar Eggemann
  6 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2016-04-29 19:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

Replace all occurrences of se->my_q right values with group_cfs_rq(se)
so it is used consistently to access the cfs_rq owned by this se/tg.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58da724b6ca4..63ed09154a41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4898,17 +4898,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 	for_each_sched_entity(se) {
 		long w, W;
 
-		tg = se->my_q->tg;
+		tg = group_cfs_rq(se)->tg;
 
 		/*
 		 * W = @wg + \Sum rw_j
 		 */
-		W = wg + calc_tg_weight(tg, se->my_q);
+		W = wg + calc_tg_weight(tg, group_cfs_rq(se));
 
 		/*
 		 * w = rw_i + @wl
 		 */
-		w = cfs_rq_load_avg(se->my_q) + wl;
+		w = cfs_rq_load_avg(group_cfs_rq(se)) + wl;
 
 		/*
 		 * wl = S * s'_i; see (2)
@@ -8528,7 +8528,7 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 		se->cfs_rq = &rq->cfs;
 		se->depth = 0;
 	} else {
-		se->cfs_rq = parent->my_q;
+		se->cfs_rq = group_cfs_rq(parent);
 		se->depth = parent->depth + 1;
 	}
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance()
  2016-04-29 19:32 ` [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance() Dietmar Eggemann
@ 2016-05-03 10:12   ` Peter Zijlstra
  2016-05-03 16:53     ` Dietmar Eggemann
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-05-03 10:12 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: linux-kernel, Morten Rasmussen

On Fri, Apr 29, 2016 at 08:32:41PM +0100, Dietmar Eggemann wrote:
> Avoid the need to add scaled_busy_load_per_task on both sides of the if
> condition to determine whether imbalance has to be set to
> busiest->load_per_task or not.
> 
> The imbn variable was introduced with commit 2dd73a4f09be ("[PATCH]
> sched: implement smpnice") and the original if condition was
> 
>   if (max_load - this_load >= busiest_load_per_task * imbn)
> 
> which over time changed into the current version where
> scaled_busy_load_per_task is to be found on both sides of
> the if condition.

This appears to have started with:

  dd41f596cda0 ("sched: cfs core code")

which for unexplained reasons does:

-               if (max_load - this_load >= busiest_load_per_task * imbn) {
+               if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+                                       busiest_load_per_task * imbn) {


And later patches (by me) change that FUZZ into a variable metric,
because a fixed fuzz like that didn't at all work for the small loads
that result from cgroup tasks.



Now fix_small_imbalance() always hurt my head; it originated in the
original sched_domain balancer from Nick which wasn't smpnice aware; and
lives on until today.

Its purpose is to determine if moving one task over is beneficial.
However over time -- and smpnice started this -- the idea of _one_ task
became quite muddled.

With the fine grained load accounting of today; does it even make sense
to ask this question? IOW. what does fix_small_imbalance() really gain
us -- other than a head-ache?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] sched/fair: Correct unit of load_above_capacity
  2016-04-29 19:32 ` [PATCH 3/7] sched/fair: Correct unit of load_above_capacity Dietmar Eggemann
@ 2016-05-03 10:52   ` Peter Zijlstra
  2016-05-03 14:56     ` Morten Rasmussen
  2016-05-12 10:31   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-05-03 10:52 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: linux-kernel, Morten Rasmussen

On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> 
> In calculate_imbalance() load_above_capacity currently has the unit
> [load] while it is used as being [load/capacity]. Not only is it wrong it
> also makes it unlikely that load_above_capacity is ever used as the
> subsequent code picks the smaller of load_above_capacity and the avg_load
> difference between the local and the busiest sched_groups.
> 
> This patch ensures that load_above_capacity has the right unit
> [load/capacity].
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc19fee94f39..c066574cff04 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	    local->group_type   == group_overloaded) {
>  		load_above_capacity = busiest->sum_nr_running *
>  					SCHED_LOAD_SCALE;

Given smpnice and cgroups; this starting point seems somewhat random.
Then again, without cgroup and assuming all tasks are nice-0 it is still
true. But I think the cgroup muck is rather universal these days.

The above SCHED_LOAD_SCALE is the 1 -> load metric, right?

> -		if (load_above_capacity > busiest->group_capacity)
> +		if (load_above_capacity > busiest->group_capacity) {
>  			load_above_capacity -= busiest->group_capacity;
> -		else
> +			load_above_capacity *= SCHED_LOAD_SCALE;
> +			load_above_capacity /= busiest->group_capacity;

And this one does load->capacity

> +		} else
>  			load_above_capacity = ~0UL;
>  	}


Is there anything better we can do? I know there's a fair bit of cruft
in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug
the hell out of me.

Would it make sense to make load_balance()->detach_tasks() ensure
busiest->sum_nr_running >= busiest->group_weight or so?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] sched/fair: Correct unit of load_above_capacity
  2016-05-03 10:52   ` Peter Zijlstra
@ 2016-05-03 14:56     ` Morten Rasmussen
  0 siblings, 0 replies; 21+ messages in thread
From: Morten Rasmussen @ 2016-05-03 14:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dietmar Eggemann, linux-kernel

On Tue, May 03, 2016 at 12:52:33PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote:
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > 
> > In calculate_imbalance() load_above_capacity currently has the unit
> > [load] while it is used as being [load/capacity]. Not only is it wrong it
> > also makes it unlikely that load_above_capacity is ever used as the
> > subsequent code picks the smaller of load_above_capacity and the avg_load
> > difference between the local and the busiest sched_groups.
> > 
> > This patch ensures that load_above_capacity has the right unit
> > [load/capacity].
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bc19fee94f39..c066574cff04 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  	    local->group_type   == group_overloaded) {
> >  		load_above_capacity = busiest->sum_nr_running *
> >  					SCHED_LOAD_SCALE;
> 
> Given smpnice and cgroups; this starting point seems somewhat random.
> Then again, without cgroup and assuming all tasks are nice-0 it is still
> true. But I think the cgroup muck is rather universal these days.

It makes sense for non-grouped always-running nice-0 tasks, and not so
much for all other scenarios. We ignore smpnice, PELT, and cgroups :-(

> 
> The above SCHED_LOAD_SCALE is the 1 -> load metric, right?

Yes. For clarity I would probably have used NICE_O_LOAD instead. We are
establishing how much capacity sum_nr_running task would consume if they
are all non-grouped always-running nice-0 tasks.

> 
> > -		if (load_above_capacity > busiest->group_capacity)
> > +		if (load_above_capacity > busiest->group_capacity) {
> >  			load_above_capacity -= busiest->group_capacity;

If they would consume more than we have, we return how much more
otherwise LONG_MAX.

> > -		else
> > +			load_above_capacity *= SCHED_LOAD_SCALE;
> > +			load_above_capacity /= busiest->group_capacity;
> 
> And this one does load->capacity

It does load->[load/capacity]

The subsequent code does [load/capacity]->load:

	env->imbalance = min(
-->		max_pull * busiest->group_capacity,
		(sds->avg_load - local->avg_load) * local->group_capacity
	) / SCHED_CAPACITY_SCALE;

> 
> > +		} else
> >  			load_above_capacity = ~0UL;
> >  	}
> 
> 
> Is there anything better we can do? I know there's a fair bit of cruft
> in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug
> the hell out of me.
>
> Would it make sense to make load_balance()->detach_tasks() ensure
> busiest->sum_nr_running >= busiest->group_weight or so?

That would make a lot of sense, because that is essentially what the
code does for non-SMT systems. The story is a bit different for SMT
though since:

	busiest->group_capacity < busiest->group_weight * SCHED_LOAD_SCALE.

This could lead us to try pulling more tasks to vacate SMT hw-threads.
However, I'm not convinced if it has any effect as:

	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

is the _minimum_ of the actual load difference and the 'excess' load, so
load_above_capacity can't cause us to pull more load anyway. So I don't
see why it shouldn't be okay.

Something like the below should implement your proposal I think (only
boot tested on arm64):

---8<---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..37fcbec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5777,6 +5777,7 @@ struct lb_env {
 	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
+	long			imbalance_tasks;
 	/* The set of CPUs under consideration for load-balancing */
 	struct cpumask		*cpus;
 
@@ -6018,7 +6019,7 @@ static int detach_tasks(struct lb_env *env)
 
 	lockdep_assert_held(&env->src_rq->lock);
 
-	if (env->imbalance <= 0)
+	if (env->imbalance <= 0 || env->imbalance_tasks == 0)
 		return 0;
 
 	while (!list_empty(tasks)) {
@@ -6072,9 +6073,9 @@ static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * weighted load.
+		 * weighted load or max number of tasks.
 		 */
-		if (env->imbalance <= 0)
+		if (env->imbalance <= 0 || env->imbalance_tasks <= detached)
 			break;
 
 		continue;
@@ -6873,12 +6874,14 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
+	unsigned long max_pull;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	env->imbalance_tasks = ~0UL;
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -6904,23 +6907,17 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
-		else
-			load_above_capacity = ~0UL;
+		if (busiest->sum_nr_running > busiest->group_weight)
+			env->imbalance_tasks =
+				busiest->sum_nr_running - busiest->group_weight;
 	}
 
 	/*
 	 * We're trying to get all the cpus to the average_load, so we don't
 	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded cpu below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group capacity
-	 * (so that we can implement power-savings policies etc). Thus we look
-	 * for the minimum possible imbalance.
+	 * reduce the max loaded cpu below the average load.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
+	max_pull = busiest->avg_load - sds->avg_load;
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance()
  2016-05-03 10:12   ` Peter Zijlstra
@ 2016-05-03 16:53     ` Dietmar Eggemann
  0 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2016-05-03 16:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Morten Rasmussen

On 03/05/16 11:12, Peter Zijlstra wrote:
> On Fri, Apr 29, 2016 at 08:32:41PM +0100, Dietmar Eggemann wrote:
>> Avoid the need to add scaled_busy_load_per_task on both sides of the if
>> condition to determine whether imbalance has to be set to
>> busiest->load_per_task or not.
>>
>> The imbn variable was introduced with commit 2dd73a4f09be ("[PATCH]
>> sched: implement smpnice") and the original if condition was
>>
>>   if (max_load - this_load >= busiest_load_per_task * imbn)
>>
>> which over time changed into the current version where
>> scaled_busy_load_per_task is to be found on both sides of
>> the if condition.
> 
> This appears to have started with:
> 
>   dd41f596cda0 ("sched: cfs core code")
> 
> which for unexplained reasons does:
> 
> -               if (max_load - this_load >= busiest_load_per_task * imbn) {
> +               if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> +                                       busiest_load_per_task * imbn) {
> 
> 
> And later patches (by me) change that FUZZ into a variable metric,
> because a fixed fuzz like that didn't at all work for the small loads
> that result from cgroup tasks.
> 
> 
> 
> Now fix_small_imbalance() always hurt my head; it originated in the
> original sched_domain balancer from Nick which wasn't smpnice aware; and
> lives on until today.

I see, all this code is already in the history.git kernel.

> 
> Its purpose is to determine if moving one task over is beneficial.
> However over time -- and smpnice started this -- the idea of _one_ task
> became quite muddled.
> 
> With the fine grained load accounting of today; does it even make sense
> to ask this question? IOW. what does fix_small_imbalance() really gain
> us -- other than a head-ache?

So task priority breaks the assumption that 1 task is equivalent to
SCHED_LOAD_SCALE and so does fine grained load accounting.

fix_small_imbalance() is called twice from calculate_imbalance, if we
would get rid of it, I don't know if we should bail out of lb in case
the avg load values don't align nicely (busiest > sd avg > local) or
just continue w/ lb.

In the second case, where the imbalance value is raised (to
busiest->load_per_task), we probably can just continue w/ lb, hoping
that there is a task on the src rq which fits the smaller imbalance value.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip:sched/core] sched/fair: Remove stale power aware scheduling comments
  2016-04-29 19:32 ` [PATCH 1/7] sched/fair: Remove remaining power aware scheduling comments Dietmar Eggemann
@ 2016-05-05  9:42   ` tip-bot for Dietmar Eggemann
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2016-05-05  9:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, dietmar.eggemann, efault, linux-kernel, mingo,
	peterz, tglx, morten.rasmussen

Commit-ID:  0a9b23ce46cd5d3a360fbefca8ffce441c55046e
Gitweb:     http://git.kernel.org/tip/0a9b23ce46cd5d3a360fbefca8ffce441c55046e
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Fri, 29 Apr 2016 20:32:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 09:41:09 +0200

sched/fair: Remove stale power aware scheduling comments

Commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling remnants
and dysfunctional knobs") deleted the power aware scheduling support.

This patch gets rid of the remaining power aware scheduling related
comments in the code as well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-2-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a00c7c..537d71e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7027,9 +7027,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * We're trying to get all the cpus to the average_load, so we don't
 	 * want to push ourselves above the average load, nor do we wish to
 	 * reduce the max loaded cpu below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group capacity
-	 * (so that we can implement power-savings policies etc). Thus we look
-	 * for the minimum possible imbalance.
+	 * we also don't want to reduce the group load below the group
+	 * capacity. Thus we look for the minimum possible imbalance.
 	 */
 	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
 
@@ -7053,10 +7052,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
- * if there is an imbalance. If there isn't an imbalance, and
- * the user has opted for power-savings, it returns a group whose
- * CPUs can be put to idle by rebalancing those tasks elsewhere, if
- * such a group exists.
+ * if there is an imbalance.
  *
  * Also calculates the amount of weighted load which should be moved
  * to restore balance.
@@ -7064,9 +7060,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  * @env: The load balancing environment.
  *
  * Return:	- The busiest group if imbalance exists.
- *		- If no imbalance and user has opted for power-savings balance,
- *		   return the least loaded group whose CPUs can be
- *		   put to idle by rebalancing its tasks onto our group.
  */
 static struct sched_group *find_busiest_group(struct lb_env *env)
 {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:sched/core] sched/fair: Fix comment in calculate_imbalance()
  2016-04-29 19:32 ` [PATCH 2/7] sched/fair: Fix comment in calculate_imbalance() Dietmar Eggemann
@ 2016-05-05  9:42   ` tip-bot for Dietmar Eggemann
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2016-05-05  9:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dietmar.eggemann, linux-kernel, morten.rasmussen, torvalds, hpa,
	tglx, peterz, efault, mingo

Commit-ID:  885e542ce827f5f102fe9628d63c6430c8b7ab2c
Gitweb:     http://git.kernel.org/tip/885e542ce827f5f102fe9628d63c6430c8b7ab2c
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Fri, 29 Apr 2016 20:32:39 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 09:41:10 +0200

sched/fair: Fix comment in calculate_imbalance()

The comment in calculate_imbalance() was introduced in commit:

 2dd73a4f09be ("[PATCH] sched: implement smpnice")

which described the logic as it was then, but a later commit:

  b18855500fc4 ("sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()")

.. complicated this logic some more so that the comment does not match anymore.

Update the comment to match the code.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-3-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 537d71e..51f7a4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7000,9 +7000,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	}
 
 	/*
-	 * In the presence of smp nice balancing, certain scenarios can have
-	 * max load less than avg load(as we skip the groups at or below
-	 * its cpu_capacity, while calculating max_load..)
+	 * Avg load of busiest sg can be less and avg load of local sg can
+	 * be greater than avg load across all sgs of sd because avg load
+	 * factors in sg capacity and sgs with smaller group_type are
+	 * skipped when updating the busiest sg:
 	 */
 	if (busiest->avg_load <= sds->avg_load ||
 	    local->avg_load >= sds->avg_load) {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-04-29 19:32 ` [PATCH 3/7] sched/fair: Correct unit of load_above_capacity Dietmar Eggemann
  2016-05-03 10:52   ` Peter Zijlstra
@ 2016-05-12 10:31   ` tip-bot for Morten Rasmussen
  2016-05-12 21:48     ` Yuyang Du
  1 sibling, 1 reply; 21+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-05-12 10:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: efault, peterz, torvalds, linux-kernel, dietmar.eggemann, mingo,
	hpa, morten.rasmussen, tglx

Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35
Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 12 May 2016 09:55:33 +0200

sched/fair: Correct unit of load_above_capacity

In calculate_imbalance() load_above_capacity currently has the unit
[capacity] while it is used as being [load/capacity]. Not only is it
wrong it also makes it unlikely that load_above_capacity is ever used
as the subsequent code picks the smaller of load_above_capacity and
the avg_load

This patch ensures that load_above_capacity has the right unit
[load/capacity].

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[ Changed changelog to note it was in capacity unit; +rebase. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2338105..218f8e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
-		if (load_above_capacity > busiest->group_capacity)
+		if (load_above_capacity > busiest->group_capacity) {
 			load_above_capacity -= busiest->group_capacity;
-		else
+			load_above_capacity *= NICE_0_LOAD;
+			load_above_capacity /= busiest->group_capacity;
+		} else
 			load_above_capacity = ~0UL;
 	}
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-05-12 10:31   ` [tip:sched/core] " tip-bot for Morten Rasmussen
@ 2016-05-12 21:48     ` Yuyang Du
  2016-05-13  8:22       ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-05-12 21:48 UTC (permalink / raw)
  To: peterz, linux-kernel, efault, mingo, morten.rasmussen,
	dietmar.eggemann, vincent.guittot

On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35
> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
> Author:     Morten Rasmussen <morten.rasmussen@arm.com>
> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Thu, 12 May 2016 09:55:33 +0200
> 
> sched/fair: Correct unit of load_above_capacity
> 
> In calculate_imbalance() load_above_capacity currently has the unit
> [capacity] while it is used as being [load/capacity]. Not only is it
> wrong it also makes it unlikely that load_above_capacity is ever used
> as the subsequent code picks the smaller of load_above_capacity and
> the avg_load
> 
> This patch ensures that load_above_capacity has the right unit
> [load/capacity].
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> [ Changed changelog to note it was in capacity unit; +rebase. ]
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/fair.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2338105..218f8e8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	if (busiest->group_type == group_overloaded &&
>  	    local->group_type   == group_overloaded) {
>  		load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
> -		if (load_above_capacity > busiest->group_capacity)
> +		if (load_above_capacity > busiest->group_capacity) {
>  			load_above_capacity -= busiest->group_capacity;
> -		else
> +			load_above_capacity *= NICE_0_LOAD;
> +			load_above_capacity /= busiest->group_capacity;
> +		} else
>  			load_above_capacity = ~0UL;
>  	}
  
Hi Morten,

I got the feeling this might be wrong, the NICE_0_LOAD should be scaled down.
But I hope I am wrong.

Vincent, could you take a look?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-05-12 21:48     ` Yuyang Du
@ 2016-05-13  8:22       ` Vincent Guittot
  2016-05-19 15:36         ` Morten Rasmussen
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2016-05-13  8:22 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, linux-kernel, Mike Galbraith, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann

On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote:
> On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
>> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35
>> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
>> Author:     Morten Rasmussen <morten.rasmussen@arm.com>
>> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
>> Committer:  Ingo Molnar <mingo@kernel.org>
>> CommitDate: Thu, 12 May 2016 09:55:33 +0200
>>
>> sched/fair: Correct unit of load_above_capacity
>>
>> In calculate_imbalance() load_above_capacity currently has the unit
>> [capacity] while it is used as being [load/capacity]. Not only is it
>> wrong it also makes it unlikely that load_above_capacity is ever used
>> as the subsequent code picks the smaller of load_above_capacity and
>> the avg_load
>>
>> This patch ensures that load_above_capacity has the right unit
>> [load/capacity].

load_above_capacity has the unit [load] and is then compared with other load

>>
>> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> [ Changed changelog to note it was in capacity unit; +rebase. ]
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: linux-kernel@vger.kernel.org
>> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>  kernel/sched/fair.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2338105..218f8e8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>       if (busiest->group_type == group_overloaded &&
>>           local->group_type   == group_overloaded) {
>>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
>> -             if (load_above_capacity > busiest->group_capacity)
>> +             if (load_above_capacity > busiest->group_capacity) {
>>                       load_above_capacity -= busiest->group_capacity;
>> -             else
>> +                     load_above_capacity *= NICE_0_LOAD;
>> +                     load_above_capacity /= busiest->group_capacity;

If I follow correctly the sequence,
load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
- busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
so
load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD

so the unit is [load]

Lets take the example of a group made of 2 cores with 2 threads per
core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
capacity of the group is 3*SCHED_CAPACITY_SCALE.
If we have 6 tasks on this group :
load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
more than 1 tasks from the group and let 5 tasks in the group whereas
we don't expect to have more than 4 tasks as we have 4 CPUs and
probably less because the compute capacity of each CPU is less than
the default one

So i would have expected
load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
load_above capacity = 3*NICE_0_LOAD which means we want to remove no
more than 3 tasks and let 3 tasks in the group

Regards,
Vincent


>> +             } else
>>                       load_above_capacity = ~0UL;
>>       }
>
> Hi Morten,
>
> I got the feeling this might be wrong, the NICE_0_LOAD should be scaled down.
> But I hope I am wrong.
>
> Vincent, could you take a look?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-05-13  8:22       ` Vincent Guittot
@ 2016-05-19 15:36         ` Morten Rasmussen
  2016-05-20  8:17           ` Vincent Guittot
  2016-05-23 20:24           ` Yuyang Du
  0 siblings, 2 replies; 21+ messages in thread
From: Morten Rasmussen @ 2016-05-19 15:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Peter Zijlstra, linux-kernel, Mike Galbraith,
	Ingo Molnar, Dietmar Eggemann, Wanpeng Li

On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:
> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote:
> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
> >> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35
> >> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
> >> Author:     Morten Rasmussen <morten.rasmussen@arm.com>
> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
> >> Committer:  Ingo Molnar <mingo@kernel.org>
> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200
> >>
> >> sched/fair: Correct unit of load_above_capacity
> >>
> >> In calculate_imbalance() load_above_capacity currently has the unit
> >> [capacity] while it is used as being [load/capacity]. Not only is it
> >> wrong it also makes it unlikely that load_above_capacity is ever used
> >> as the subsequent code picks the smaller of load_above_capacity and
> >> the avg_load
> >>
> >> This patch ensures that load_above_capacity has the right unit
> >> [load/capacity].
> 
> load_above_capacity has the unit [load] and is then compared with other load

I'm not sure if talk about the same code, I'm referring to:

max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

a bit further down. Here the first option has unit [load/capacity], and
the subsequent code multiplies the result with [capacity] to get back to
[load]:

/* How much load to actually move to equalise the imbalance */
env->imbalance = min(
        max_pull * busiest->group_capacity,
        (sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

That lead me to the conclusion that load_above_capacity would have to be
'per capacity', i.e. [load/capacity], as well for the code to make
sense.

> 
> >>
> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> >> [ Changed changelog to note it was in capacity unit; +rebase. ]
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >> Cc: Mike Galbraith <efault@gmx.de>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: linux-kernel@vger.kernel.org
> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> >> ---
> >>  kernel/sched/fair.c | 6 ++++--
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2338105..218f8e8 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >>       if (busiest->group_type == group_overloaded &&
> >>           local->group_type   == group_overloaded) {
> >>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
> >> -             if (load_above_capacity > busiest->group_capacity)
> >> +             if (load_above_capacity > busiest->group_capacity) {
> >>                       load_above_capacity -= busiest->group_capacity;
> >> -             else
> >> +                     load_above_capacity *= NICE_0_LOAD;
> >> +                     load_above_capacity /= busiest->group_capacity;
> 
> If I follow correctly the sequence,
> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
> so
> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD
> 
> so the unit is [load]

I'm with you for the equation, but not for the unit and I find it hard
convince myself what the unit really is :-(

I think it is down to the fact that we are comparing [load] directly
with [capacity] here, basically saying [load] == [capacity]. Most other
places we scale [load] by [capacity], we don't compare the two or
otherwise imply that they are the same unit. Do we have any other
examples of this?

The name load_above_capacity suggests that it should have unit [load],
but we actually compute it by subtracting to values, where the latter
clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97
sched/fair: Clean up scale confusion) changes the former to be
[capacity].

load_above_capacity is later multiplied by [capacity] to determine the
imbalance which must have unit [load]. So working our way backwards,
load_above_capacity must have unit [load/capacity]. However, if [load] =
[capacity] then what we really have is just group-normalized [load].

As said in my other reply, the code only really makes sense for under
special circumstances where [load] == [capacity]. The easy, and possibly
best, way out of this mess would be to replace the code with something
like PeterZ's suggestion that I tried to implement in the patch included
in my other reply.

> Lets take the example of a group made of 2 cores with 2 threads per
> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
> capacity of the group is 3*SCHED_CAPACITY_SCALE.
> If we have 6 tasks on this group :
> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
> more than 1 tasks from the group and let 5 tasks in the group whereas
> we don't expect to have more than 4 tasks as we have 4 CPUs and
> probably less because the compute capacity of each CPU is less than
> the default one
> 
> So i would have expected
> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
> load_above capacity = 3*NICE_0_LOAD which means we want to remove no
> more than 3 tasks and let 3 tasks in the group

And this is exactly you get with this patch :-) load_above_capacity
(through max_pull) is multiplied by the group capacity to compute that
actual amount of [load] to remove:

env->imbalance 	= load_above_capacity * busiest->group_capacity /
		 	SCHED_CAPACITY_SCALE

		= 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
			SCHED_CAPACITY_SCALE

		= 3*NICE_0_LOAD

I don't think we disagree on how it should work :-) Without the capacity
scaling in this patch you get: 

env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
			3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE

		= 9*SCHED_CAPACITY_SCALE

Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
ensure that the resulting imbalance has the proper unit [load].

I'm happy to scrap this patch and replace the whole thing with something
else that makes more sense, but IMHO it is needed if the current mess
should make any sense at all.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-05-19 15:36         ` Morten Rasmussen
@ 2016-05-20  8:17           ` Vincent Guittot
  2016-05-23 20:24           ` Yuyang Du
  1 sibling, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-05-20  8:17 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, Peter Zijlstra, linux-kernel, Mike Galbraith,
	Ingo Molnar, Dietmar Eggemann, Wanpeng Li

Hi Morten,

On 19 May 2016 at 17:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:
>> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@intel.com> wrote:
>> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
>> >> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35
>> >> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
>> >> Author:     Morten Rasmussen <morten.rasmussen@arm.com>
>> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
>> >> Committer:  Ingo Molnar <mingo@kernel.org>
>> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200
>> >>
>> >> sched/fair: Correct unit of load_above_capacity
>> >>
>> >> In calculate_imbalance() load_above_capacity currently has the unit
>> >> [capacity] while it is used as being [load/capacity]. Not only is it
>> >> wrong it also makes it unlikely that load_above_capacity is ever used
>> >> as the subsequent code picks the smaller of load_above_capacity and
>> >> the avg_load
>> >>
>> >> This patch ensures that load_above_capacity has the right unit
>> >> [load/capacity].
>>
>> load_above_capacity has the unit [load] and is then compared with other load
>
> I'm not sure if talk about the same code, I'm referring to:
>
> max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
>
> a bit further down. Here the first option has unit [load/capacity], and
> the subsequent code multiplies the result with [capacity] to get back to
> [load]:

My understand is that it multiplies and divides by capacity
so we select the minimum between
 max_pull * busiest->group_capacity/SCHED_CAPACITY_SCALE
and
(sds->avg_load - local->avg_load) * local->group_capacity  /
SCHED_CAPACITY_SCALE

so the unit of imbalance is the same as max_pull because we multiply
and divide by [capacity]
the imbalance's unit is [load] so max_pull also has a unit [load]
and max_pull has the same unit of load_above_capacity

>
> /* How much load to actually move to equalise the imbalance */
> env->imbalance = min(
>         max_pull * busiest->group_capacity,
>         (sds->avg_load - local->avg_load) * local->group_capacity
> ) / SCHED_CAPACITY_SCALE;
>
> That lead me to the conclusion that load_above_capacity would have to be
> 'per capacity', i.e. [load/capacity], as well for the code to make
> sense.
>
>>
>> >>
>> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> >> [ Changed changelog to note it was in capacity unit; +rebase. ]
>> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> >> Cc: Mike Galbraith <efault@gmx.de>
>> >> Cc: Peter Zijlstra <peterz@infradead.org>
>> >> Cc: Thomas Gleixner <tglx@linutronix.de>
>> >> Cc: linux-kernel@vger.kernel.org
>> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
>> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> >> ---
>> >>  kernel/sched/fair.c | 6 ++++--
>> >>  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 2338105..218f8e8 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> >>       if (busiest->group_type == group_overloaded &&
>> >>           local->group_type   == group_overloaded) {
>> >>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
>> >> -             if (load_above_capacity > busiest->group_capacity)
>> >> +             if (load_above_capacity > busiest->group_capacity) {
>> >>                       load_above_capacity -= busiest->group_capacity;
>> >> -             else
>> >> +                     load_above_capacity *= NICE_0_LOAD;
>> >> +                     load_above_capacity /= busiest->group_capacity;
>>
>> If I follow correctly the sequence,
>> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
>> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
>> so
>> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
>> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD
>>
>> so the unit is [load]
>
> I'm with you for the equation, but not for the unit and I find it hard
> convince myself what the unit really is :-(

the sole NICE_0_LOAD in the equation above tends to confirms that it's
only [load]

>
> I think it is down to the fact that we are comparing [load] directly
> with [capacity] here, basically saying [load] == [capacity]. Most other
> places we scale [load] by [capacity], we don't compare the two or
> otherwise imply that they are the same unit. Do we have any other
> examples of this?

IIUC the goal of this part of calculate_imbalance, we make the
assumption that one task with NICE_0_LOAD should fit in one core with
SCHED_CAPACITY_SCALE capacity and we want to ensure that we will not
make a CPU idle

>
> The name load_above_capacity suggests that it should have unit [load],
> but we actually compute it by subtracting to values, where the latter
> clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97
> sched/fair: Clean up scale confusion) changes the former to be
> [capacity].

I have made a comment on this.
The original equation was
load_above_capacity -= busiest->group_capacity
which come from the optimization of
load_above_capacity -= busiest->group_capacity * SCHED_LOAD_SCALE /
SCHED_CAPACITY_SCALE

The assumption of the optimization was the SCHED_LOAD_SCALE ==
SCHED_CAPACITY_SCALE and SCHED_LOAD_SCALE == NICE_0_LOAD but  it's not
always true if  we enable the extra precision for 64bits system

>
> load_above_capacity is later multiplied by [capacity] to determine the
> imbalance which must have unit [load]. So working our way backwards,

it's multiplied by xxx->group_capacity and divided by  / SCHED_CAPACITY_SCALE;

> load_above_capacity must have unit [load/capacity]. However, if [load] =
> [capacity] then what we really have is just group-normalized [load].
>
> As said in my other reply, the code only really makes sense for under
> special circumstances where [load] == [capacity]. The easy, and possibly
> best, way out of this mess would be to replace the code with something
> like PeterZ's suggestion that I tried to implement in the patch included
> in my other reply.

I'm fine with replacing this part by something more simple

>
>> Lets take the example of a group made of 2 cores with 2 threads per
>> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
>> capacity of the group is 3*SCHED_CAPACITY_SCALE.
>> If we have 6 tasks on this group :
>> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
>> more than 1 tasks from the group and let 5 tasks in the group whereas
>> we don't expect to have more than 4 tasks as we have 4 CPUs and
>> probably less because the compute capacity of each CPU is less than
>> the default one
>>
>> So i would have expected
>> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
>> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
>> load_above capacity = 3*NICE_0_LOAD which means we want to remove no
>> more than 3 tasks and let 3 tasks in the group
>
> And this is exactly you get with this patch :-) load_above_capacity
> (through max_pull) is multiplied by the group capacity to compute that
> actual amount of [load] to remove:

I forgot that additional weight of the load by group's
capacity/SCHED_CAPACITY_SCALE

>
> env->imbalance  = load_above_capacity * busiest->group_capacity /
>                         SCHED_CAPACITY_SCALE
>
>                 = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
>                         SCHED_CAPACITY_SCALE
>
>                 = 3*NICE_0_LOAD
>
> I don't think we disagree on how it should work :-) Without the capacity
> scaling in this patch you get:
>
> env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
>                         3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE
>
>                 = 9*SCHED_CAPACITY_SCALE
>
> Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
> ensure that the resulting imbalance has the proper unit [load].

yes, i agree it's should NICE_0_LOAD

>
> I'm happy to scrap this patch and replace the whole thing with something
> else that makes more sense, but IMHO it is needed if the current mess
> should make any sense at all.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-05-19 15:36         ` Morten Rasmussen
  2016-05-20  8:17           ` Vincent Guittot
@ 2016-05-23 20:24           ` Yuyang Du
  2016-05-30 22:35             ` Yuyang Du
  1 sibling, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-05-23 20:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Peter Zijlstra, linux-kernel, Mike Galbraith,
	Ingo Molnar, Dietmar Eggemann, Wanpeng Li

On Thu, May 19, 2016 at 04:36:38PM +0100, Morten Rasmussen wrote:
> 
> And this is exactly you get with this patch :-) load_above_capacity
> (through max_pull) is multiplied by the group capacity to compute that
> actual amount of [load] to remove:
> 
> env->imbalance 	= load_above_capacity * busiest->group_capacity /
> 		 	SCHED_CAPACITY_SCALE
> 
> 		= 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
> 			SCHED_CAPACITY_SCALE
> 
> 		= 3*NICE_0_LOAD
> 
> I don't think we disagree on how it should work :-) Without the capacity
> scaling in this patch you get: 
> 
> env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
> 			3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE
> 
> 		= 9*SCHED_CAPACITY_SCALE
> 
> Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
> ensure that the resulting imbalance has the proper unit [load].

Sorry, I'm still confused. After this patch, the unit is indeed [load], the
load same as the weight visible to the user. However, you literally compared
it with sg_lb_stats's avg_load and load_per_task, which has the unit of
load_avg, which is scaled_load_down(NICE_0_LOAD). Am I missing something?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
  2016-05-23 20:24           ` Yuyang Du
@ 2016-05-30 22:35             ` Yuyang Du
  0 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-30 22:35 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Peter Zijlstra, linux-kernel, Mike Galbraith,
	Ingo Molnar, Dietmar Eggemann, Wanpeng Li

On Tue, May 24, 2016 at 04:24:01AM +0800, Yuyang Du wrote:
> On Thu, May 19, 2016 at 04:36:38PM +0100, Morten Rasmussen wrote:
> > 
> > And this is exactly you get with this patch :-) load_above_capacity
> > (through max_pull) is multiplied by the group capacity to compute that
> > actual amount of [load] to remove:
> > 
> > env->imbalance 	= load_above_capacity * busiest->group_capacity /
> > 		 	SCHED_CAPACITY_SCALE
> > 
> > 		= 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
> > 			SCHED_CAPACITY_SCALE
> > 
> > 		= 3*NICE_0_LOAD
> > 
> > I don't think we disagree on how it should work :-) Without the capacity
> > scaling in this patch you get: 
> > 
> > env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
> > 			3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE
> > 
> > 		= 9*SCHED_CAPACITY_SCALE
> > 
> > Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
> > ensure that the resulting imbalance has the proper unit [load].
> 
> Sorry, I'm still confused. After this patch, the unit is indeed [load], the
> load same as the weight visible to the user. However, you literally compared
> it with sg_lb_stats's avg_load and load_per_task, which has the unit of
> load_avg, which is scaled_load_down(NICE_0_LOAD). Am I missing something?

Hello?

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-31  6:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-29 19:32 [PATCH 0/7] sched/fair: fixes and cleanups Dietmar Eggemann
2016-04-29 19:32 ` [PATCH 1/7] sched/fair: Remove remaining power aware scheduling comments Dietmar Eggemann
2016-05-05  9:42   ` [tip:sched/core] sched/fair: Remove stale " tip-bot for Dietmar Eggemann
2016-04-29 19:32 ` [PATCH 2/7] sched/fair: Fix comment in calculate_imbalance() Dietmar Eggemann
2016-05-05  9:42   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
2016-04-29 19:32 ` [PATCH 3/7] sched/fair: Correct unit of load_above_capacity Dietmar Eggemann
2016-05-03 10:52   ` Peter Zijlstra
2016-05-03 14:56     ` Morten Rasmussen
2016-05-12 10:31   ` [tip:sched/core] " tip-bot for Morten Rasmussen
2016-05-12 21:48     ` Yuyang Du
2016-05-13  8:22       ` Vincent Guittot
2016-05-19 15:36         ` Morten Rasmussen
2016-05-20  8:17           ` Vincent Guittot
2016-05-23 20:24           ` Yuyang Du
2016-05-30 22:35             ` Yuyang Du
2016-04-29 19:32 ` [PATCH 4/7] sched/fair: Clean up the logic in fix_small_imbalance() Dietmar Eggemann
2016-05-03 10:12   ` Peter Zijlstra
2016-05-03 16:53     ` Dietmar Eggemann
2016-04-29 19:32 ` [PATCH 5/7] sched/fair: Remove cpu_avg_load_per_task() Dietmar Eggemann
2016-04-29 19:32 ` [PATCH 6/7] sched/fair: Reorder code in update_sd_lb_stats() Dietmar Eggemann
2016-04-29 19:32 ` [PATCH 7/7] sched/fair: Use group_cfs_rq(se) instead of se->my_q Dietmar Eggemann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.