linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] sched/fair: rework the CFS load balance
@ 2019-09-19  7:33 Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
                   ` (11 more replies)
  0 siblings, 12 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

Several wrong task placement have been raised with the current load
balance algorithm but their fixes are not always straight forward and
end up with using biased values to force migrations. A cleanup and rework
of the load balance will help to handle such UCs and enable to fine grain
the behavior of the scheduler for other cases.

Patch 1 has already been sent separately and only consolidate asym policy
in one place and help the review of the changes in load_balance.

Patch 2 renames the sum of h_nr_running in stats.

Patch 3 removes meaningless imbalance computation to make review of
patch 4 easier.

Patch 4 reworks load_balance algorithm and fixes some wrong task placement
but try to stay conservative.

Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
into account when pulling tasks.

Patch 6 replaces runnable_load by load now that the signal is only used
when overloaded.

Patch 7 improves the spread of tasks at the 1st scheduling level.

Patch 8 uses utilization instead of load in all steps of misfit task
path.

Patch 9 replaces runnable_load_avg by load_avg in the wake up path.

Patch 10 optimizes find_idlest_group() that was using both runnable_load
and load. This has not been squashed with previous patch to ease the
review.

Some benchmarks results based on 8 iterations of each tests:
- small arm64 dual quad cores system

           tip/sched/core        w/ this patchset    improvement
schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)

hackbench
1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)

- large arm64 2 nodes / 224 cores system

           tip/sched/core        w/ this patchset    improvement
schedpipe     125323 +/-0.98%       125624 +/-0.71%   (+0.24%)

hackbench -l (256000/#grp) -g #grp
1 groups      15.360 +/-1.76%       14.206 +/-1.40%   (+8.69%)
4 groups       5.822 +/-1.02%        5.508 +/-6.45%   (+5.38%)
16 groups      3.103 +/-0.80%        3.244 +/-0.77%   (-4.52%)
32 groups      2.892 +/-1.23%        2.850 +/-1.81%   (+1.47%)
64 groups      2.825 +/-1.51%        2.725 +/-1.51%   (+3.54%)
128 groups     3.149 +/-8.46%        3.053 +/-13.15%  (+3.06%)
256 groups     3.511 +/-8.49%        3.019 +/-1.71%  (+14.03%)

dbench
1 groups     329.677 +/-0.46%      329.771 +/-0.11%   (+0.03%)
4 groups     931.499 +/-0.79%      947.118 +/-0.94%   (+1.68%)
16 groups   1924.210 +/-0.89%     1947.849 +/-0.76%   (+1.23%)
32 groups   2350.646 +/-5.75%     2351.549 +/-6.33%   (+0.04%)
64 groups   2201.524 +/-3.35%     2192.749 +/-5.84%   (-0.40%)
128 groups  2206.858 +/-2.50%     2376.265 +/-7.44%   (+7.68%)
256 groups  1263.520 +/-3.34%     1633.143 +/-13.02% (+29.25%)

tip/sched/core sha1:
  0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')

Changes since v2:
- fix typo and reorder code
- some minor code fixes
- optimize the find_idles_group()

Not covered in this patchset:
- update find_idlest_group() to be more aligned with load_balance(). I didn't
  want to delay this version because of this update which is not ready yet
- Better detection of overloaded and fully busy state, especially for cases
  when nr_running > nr CPUs.

Vincent Guittot (8):
  sched/fair: clean up asym packing
  sched/fair: rename sum_nr_running to sum_h_nr_running
  sched/fair: remove meaningless imbalance calculation
  sched/fair: rework load_balance
  sched/fair: use rq->nr_running when balancing load
  sched/fair: use load instead of runnable load in load_balance
  sched/fair: evenly spread tasks when not overloaded
  sched/fair: use utilization to select misfit task
  sched/fair: use load instead of runnable load in wakeup path
  sched/fair: optimize find_idlest_group

 kernel/sched/fair.c | 805 +++++++++++++++++++++++++++-------------------------
 1 file changed, 417 insertions(+), 388 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v3 01/10] sched/fair: clean up asym packing
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-27 23:57   ` Rik van Riel
  2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

Clean up asym packing to follow the default load balance behavior:
- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2c..3175fea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7685,6 +7685,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -8139,9 +8140,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * ASYM_PACKING needs to move all the work to the highest
 	 * prority CPUs in the group, therefore mark all groups
 	 * of lower priority than ourself as busy.
+	 *
+	 * This is primarily intended to used at the sibling level.  Some
+	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
+	 * case of POWER7, it can move to lower SMT modes only when higher
+	 * threads are idle.  When in lower SMT modes, the threads will
+	 * perform better since they share less core resources.  Hence when we
+	 * have idle threads, we want them to be the higher ones.
 	 */
 	if (sgs->sum_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
 		if (!sds->busiest)
 			return true;
 
@@ -8283,51 +8292,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * check_asym_packing - Check to see if the group is packed into the
- *			sched domain.
- *
- * This is primarily intended to used at the sibling level.  Some
- * cores like POWER7 prefer to use lower numbered SMT threads.  In the
- * case of POWER7, it can move to lower SMT modes only when higher
- * threads are idle.  When in lower SMT modes, the threads will
- * perform better since they share less core resources.  Hence when we
- * have idle threads, we want them to be the higher ones.
- *
- * This packing function is run on idle threads.  It checks to see if
- * the busiest CPU in this domain (core in the P7 case) has a higher
- * CPU number than the packing function is being run on.  Here we are
- * assuming lower CPU number will be equivalent to lower a SMT thread
- * number.
- *
- * Return: 1 when packing is required and a task should be moved to
- * this CPU.  The amount of the imbalance is returned in env->imbalance.
- *
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain which is to be packed
- */
-static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	int busiest_cpu;
-
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return 0;
-
-	if (env->idle == CPU_NOT_IDLE)
-		return 0;
-
-	if (!sds->busiest)
-		return 0;
-
-	busiest_cpu = sds->busiest->asym_prefer_cpu;
-	if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
-		return 0;
-
-	env->imbalance = sds->busiest_stat.group_load;
-
-	return 1;
-}
-
-/**
  * fix_small_imbalance - Calculate the minor imbalance that exists
  *			amongst the groups of a sched_domain, during
  *			load balancing.
@@ -8411,6 +8375,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	if (busiest->group_asym_packing) {
+		env->imbalance = busiest->group_load;
+		return;
+	}
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -8515,8 +8484,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	busiest = &sds.busiest_stat;
 
 	/* ASYM feature bypasses nice load balance check */
-	if (check_asym_packing(env, &sds))
-		return sds.busiest;
+	if (busiest->group_asym_packing)
+		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || busiest->sum_nr_running == 0)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-27 23:59   ` Rik van Riel
  2019-10-01 17:11   ` Valentin Schneider
  2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3175fea..02ab6b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7680,7 +7680,7 @@ struct sg_lb_stats {
 	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
-	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
@@ -7725,7 +7725,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.total_capacity = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
-			.sum_nr_running = 0,
+			.sum_h_nr_running = 0,
 			.group_type = group_other,
 		},
 	};
@@ -7916,7 +7916,7 @@ static inline int sg_imbalanced(struct sched_group *group)
 static inline bool
 group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running < sgs->group_weight)
+	if (sgs->sum_h_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
@@ -7937,7 +7937,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running <= sgs->group_weight)
+	if (sgs->sum_h_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
@@ -8029,7 +8029,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_runnable_load(rq);
 		sgs->group_util += cpu_util(i);
-		sgs->sum_nr_running += rq->cfs.h_nr_running;
+		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
@@ -8059,8 +8059,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
-	if (sgs->sum_nr_running)
-		sgs->load_per_task = sgs->group_load / sgs->sum_nr_running;
+	if (sgs->sum_h_nr_running)
+		sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
 
 	sgs->group_weight = group->group_weight;
 
@@ -8117,7 +8117,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * capable CPUs may harm throughput. Maximize throughput,
 	 * power/energy consequences are not considered.
 	 */
-	if (sgs->sum_nr_running <= sgs->group_weight &&
+	if (sgs->sum_h_nr_running <= sgs->group_weight &&
 	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
@@ -8148,7 +8148,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * perform better since they share less core resources.  Hence when we
 	 * have idle threads, we want them to be the higher ones.
 	 */
-	if (sgs->sum_nr_running &&
+	if (sgs->sum_h_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
 		sgs->group_asym_packing = 1;
 		if (!sds->busiest)
@@ -8166,9 +8166,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 #ifdef CONFIG_NUMA_BALANCING
 static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running > sgs->nr_numa_running)
+	if (sgs->sum_h_nr_running > sgs->nr_numa_running)
 		return regular;
-	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+	if (sgs->sum_h_nr_running > sgs->nr_preferred_running)
 		return remote;
 	return all;
 }
@@ -8243,7 +8243,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		 */
 		if (prefer_sibling && sds->local &&
 		    group_has_capacity(env, local) &&
-		    (sgs->sum_nr_running > local->sum_nr_running + 1)) {
+		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
 			sgs->group_no_capacity = 1;
 			sgs->group_type = group_classify(sg, sgs);
 		}
@@ -8255,7 +8255,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_nr_running;
+		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
@@ -8309,7 +8309,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (!local->sum_nr_running)
+	if (!local->sum_h_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)
 		imbn = 1;
@@ -8407,7 +8407,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
+		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
 		if (load_above_capacity > busiest->group_capacity) {
 			load_above_capacity -= busiest->group_capacity;
 			load_above_capacity *= scale_load_down(NICE_0_LOAD);
@@ -8488,7 +8488,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_nr_running == 0)
+	if (!sds.busiest || busiest->sum_h_nr_running == 0)
 		goto out_balanced;
 
 	/* XXX broken for overlapping NUMA groups */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-28  0:05   ` Rik van Riel
  2019-10-01 17:12   ` Valentin Schneider
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

clean up load_balance and remove meaningless calculation and fields before
adding new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 105 +---------------------------------------------------
 1 file changed, 1 insertion(+), 104 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02ab6b5..017aad0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5390,18 +5390,6 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
-static unsigned long cpu_avg_load_per_task(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
-	unsigned long load_avg = cpu_runnable_load(rq);
-
-	if (nr_running)
-		return load_avg / nr_running;
-
-	return 0;
-}
-
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -7677,7 +7665,6 @@ static unsigned long task_h_load(struct task_struct *p)
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
@@ -8059,9 +8046,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
-	if (sgs->sum_h_nr_running)
-		sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
-
 	sgs->group_weight = group->group_weight;
 
 	sgs->group_no_capacity = group_is_overloaded(env, sgs);
@@ -8292,76 +8276,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * fix_small_imbalance - Calculate the minor imbalance that exists
- *			amongst the groups of a sched_domain, during
- *			load balancing.
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
- */
-static inline
-void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	unsigned long tmp, capa_now = 0, capa_move = 0;
-	unsigned int imbn = 2;
-	unsigned long scaled_busy_load_per_task;
-	struct sg_lb_stats *local, *busiest;
-
-	local = &sds->local_stat;
-	busiest = &sds->busiest_stat;
-
-	if (!local->sum_h_nr_running)
-		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
-	else if (busiest->load_per_task > local->load_per_task)
-		imbn = 1;
-
-	scaled_busy_load_per_task =
-		(busiest->load_per_task * SCHED_CAPACITY_SCALE) /
-		busiest->group_capacity;
-
-	if (busiest->avg_load + scaled_busy_load_per_task >=
-	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
-		env->imbalance = busiest->load_per_task;
-		return;
-	}
-
-	/*
-	 * OK, we don't have enough imbalance to justify moving tasks,
-	 * however we may be able to increase total CPU capacity used by
-	 * moving them.
-	 */
-
-	capa_now += busiest->group_capacity *
-			min(busiest->load_per_task, busiest->avg_load);
-	capa_now += local->group_capacity *
-			min(local->load_per_task, local->avg_load);
-	capa_now /= SCHED_CAPACITY_SCALE;
-
-	/* Amount of load we'd subtract */
-	if (busiest->avg_load > scaled_busy_load_per_task) {
-		capa_move += busiest->group_capacity *
-			    min(busiest->load_per_task,
-				busiest->avg_load - scaled_busy_load_per_task);
-	}
-
-	/* Amount of load we'd add */
-	if (busiest->avg_load * busiest->group_capacity <
-	    busiest->load_per_task * SCHED_CAPACITY_SCALE) {
-		tmp = (busiest->avg_load * busiest->group_capacity) /
-		      local->group_capacity;
-	} else {
-		tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
-		      local->group_capacity;
-	}
-	capa_move += local->group_capacity *
-		    min(local->load_per_task, local->avg_load + tmp);
-	capa_move /= SCHED_CAPACITY_SCALE;
-
-	/* Move if we gain throughput */
-	if (capa_move > capa_now)
-		env->imbalance = busiest->load_per_task;
-}
-
-/**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
  * @env: load balance environment
@@ -8380,15 +8294,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure CPU-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
@@ -8399,7 +8304,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	    (busiest->avg_load <= sds->avg_load ||
 	     local->avg_load >= sds->avg_load)) {
 		env->imbalance = 0;
-		return fix_small_imbalance(env, sds);
+		return;
 	}
 
 	/*
@@ -8437,14 +8342,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 				       busiest->group_misfit_task_load);
 	}
 
-	/*
-	 * if *imbalance is less than the average load per runnable task
-	 * there is no guarantee that any tasks will be moved so we'll have
-	 * a think about bumping its value to force at least one task to be
-	 * moved
-	 */
-	if (env->imbalance < busiest->load_per_task)
-		return fix_small_imbalance(env, sds);
 }
 
 /******* find_busiest_group() helpers end here *********************/
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (2 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-30  1:12   ` Rik van Riel
                     ` (5 more replies)
  2019-09-19  7:33 ` [PATCH v3 05/10] sched/fair: use rq->nr_running when balancing load Vincent Guittot
                   ` (7 subsequent siblings)
  11 siblings, 6 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

The load_balance algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced. The current algorithm ends up creating virtual
and meaningless value like the avg_load_per_task or tweaks the state of a
group to make it overloaded whereas it's not, in order to try to migrate
tasks.

load_balance should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have :
	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_capacity
	group_imbalanced
	group_overloaded

Based on the type fo sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:
	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance algorithm fixes several pending wrong tasks
placement:
- the 1 task per CPU case with asymmetrics system
- the case of cfs task preempted by other class
- the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:
- update_sd_pick_busiest() select the busiest sched_group.
- find_busiest_group() checks if there is an imbalance between local and
  busiest group.
- calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 380 insertions(+), 205 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 017aad0..d33379c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+/*
+ * group_type describes the group of CPUs at the moment of the load balance.
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the groupe_type can be simply compared when selecting the busiest
+ * group. see update_sd_pick_busiest().
+ */
 enum group_type {
-	group_other = 0,
+	group_has_spare = 0,
+	group_fully_busy,
 	group_misfit_task,
+	group_asym_packing,
 	group_imbalanced,
-	group_overloaded,
+	group_overloaded
+};
+
+enum migration_type {
+	migrate_load = 0,
+	migrate_util,
+	migrate_task,
+	migrate_misfit
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -7115,7 +7130,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
-	enum group_type		src_grp_type;
+	enum migration_type	balance_type;
 	struct list_head	tasks;
 };
 
@@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
-	unsigned long load;
+	unsigned long util, load;
 	int detached = 0;
 
 	lockdep_assert_held(&env->src_rq->lock);
@@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		load = task_h_load(p);
+		switch (env->balance_type) {
+		case migrate_load:
+			load = task_h_load(p);
 
-		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
-			goto next;
+			if (sched_feat(LB_MIN) &&
+			    load < 16 && !env->sd->nr_balance_failed)
+				goto next;
 
-		if ((load / 2) > env->imbalance)
-			goto next;
+			if ((load / 2) > env->imbalance)
+				goto next;
+
+			env->imbalance -= load;
+			break;
+
+		case migrate_util:
+			util = task_util_est(p);
+
+			if (util > env->imbalance)
+				goto next;
+
+			env->imbalance -= util;
+			break;
+
+		case migrate_task:
+			/* Migrate task */
+			env->imbalance--;
+			break;
+
+		case migrate_misfit:
+			load = task_h_load(p);
+
+			/*
+			 * utilization of misfit task might decrease a bit
+			 * since it has been recorded. Be conservative in the
+			 * condition.
+			 */
+			if (load < env->imbalance)
+				goto next;
+
+			env->imbalance = 0;
+			break;
+		}
 
 		detach_task(p, env);
 		list_add(&p->se.group_node, &env->tasks);
 
 		detached++;
-		env->imbalance -= load;
 
 #ifdef CONFIG_PREEMPT
 		/*
@@ -7406,7 +7455,7 @@ static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * runnable load.
+		 * load/util/tasks.
 		 */
 		if (env->imbalance <= 0)
 			break;
@@ -7671,7 +7720,6 @@ struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_no_capacity;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
@@ -7687,10 +7735,10 @@ struct sg_lb_stats {
 struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
-	unsigned long total_running;
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned int prefer_sibling; /* tasks should go to sibling first */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	*sds = (struct sd_lb_stats){
 		.busiest = NULL,
 		.local = NULL,
-		.total_running = 0UL,
 		.total_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
-			.avg_load = 0UL,
-			.sum_h_nr_running = 0,
-			.group_type = group_other,
+			.idle_cpus = UINT_MAX,
+			.group_type = group_has_spare,
 		},
 	};
 }
@@ -7955,19 +8001,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 }
 
 static inline enum
-group_type group_classify(struct sched_group *group,
+group_type group_classify(struct lb_env *env,
+			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
-	if (sgs->group_no_capacity)
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_asym_packing)
+		return group_asym_packing;
+
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
-	return group_other;
+	if (!group_has_capacity(env, sgs))
+		return group_fully_busy;
+
+	return group_has_spare;
 }
 
 static bool update_nohz_stats(struct rq *rq, bool force)
@@ -8004,10 +8057,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
 {
-	int i, nr_running;
+	int i, nr_running, local_group;
 
 	memset(sgs, 0, sizeof(*sgs));
 
+	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
@@ -8032,9 +8087,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-		if (!nr_running && idle_cpu(i))
+		if (!nr_running && idle_cpu(i)) {
 			sgs->idle_cpus++;
+			/* Idle cpu can't have misfit task */
+			continue;
+		}
+
+		if (local_group)
+			continue;
 
+		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
@@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
-	/* Adjust by relative CPU capacity of the group */
+	/* Check if dst cpu is idle and preferred to this group */
+	if (env->sd->flags & SD_ASYM_PACKING &&
+	    env->idle != CPU_NOT_IDLE &&
+	    sgs->sum_h_nr_running &&
+	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
+	}
+
 	sgs->group_capacity = group->sgc->capacity;
-	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_no_capacity = group_is_overloaded(env, sgs);
-	sgs->group_type = group_classify(group, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
+
+	/* Computing avg_load makes sense only when group is overloaded */
+	if (sgs->group_type == group_overloaded)
+		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+				sgs->group_capacity;
 }
 
 /**
@@ -8072,6 +8144,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/* Make sure that there is at least one task to pull */
+	if (!sgs->sum_h_nr_running)
+		return false;
+
 	/*
 	 * Don't try to pull misfit tasks we can't help.
 	 * We can use max_capacity here as reduction in capacity on some
@@ -8080,7 +8156,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
-	     !group_has_capacity(env, &sds->local_stat)))
+	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
 	if (sgs->group_type > busiest->group_type)
@@ -8089,62 +8165,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
-		return false;
-
-	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
-		goto asym_packing;
-
 	/*
-	 * Candidate sg has no more than one task per CPU and
-	 * has higher per-CPU capacity. Migrating tasks to less
-	 * capable CPUs may harm throughput. Maximize throughput,
-	 * power/energy consequences are not considered.
+	 * The candidate and the current busiest group are the same type of
+	 * group. Let check which one is the busiest according to the type.
 	 */
-	if (sgs->sum_h_nr_running <= sgs->group_weight &&
-	    group_smaller_min_cpu_capacity(sds->local, sg))
-		return false;
 
-	/*
-	 * If we have more than one misfit sg go with the biggest misfit.
-	 */
-	if (sgs->group_type == group_misfit_task &&
-	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+	switch (sgs->group_type) {
+	case group_overloaded:
+		/* Select the overloaded group with highest avg_load. */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_imbalanced:
+		/*
+		 * Select the 1st imbalanced group as we don't have any way to
+		 * choose one more than another.
+		 */
 		return false;
 
-asym_packing:
-	/* This is the busiest node in its class. */
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return true;
+	case group_asym_packing:
+		/* Prefer to move from lowest priority CPU's work */
+		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
+			return false;
+		break;
 
-	/* No ASYM_PACKING if target CPU is already busy */
-	if (env->idle == CPU_NOT_IDLE)
-		return true;
-	/*
-	 * ASYM_PACKING needs to move all the work to the highest
-	 * prority CPUs in the group, therefore mark all groups
-	 * of lower priority than ourself as busy.
-	 *
-	 * This is primarily intended to used at the sibling level.  Some
-	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
-	 * case of POWER7, it can move to lower SMT modes only when higher
-	 * threads are idle.  When in lower SMT modes, the threads will
-	 * perform better since they share less core resources.  Hence when we
-	 * have idle threads, we want them to be the higher ones.
-	 */
-	if (sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
-		sgs->group_asym_packing = 1;
-		if (!sds->busiest)
-			return true;
+	case group_misfit_task:
+		/*
+		 * If we have more than one misfit sg go with the biggest
+		 * misfit.
+		 */
+		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+			return false;
+		break;
 
-		/* Prefer to move from lowest priority CPU's work */
-		if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
-				      sg->asym_prefer_cpu))
-			return true;
+	case group_fully_busy:
+		/*
+		 * Select the fully busy group with highest avg_load. In
+		 * theory, there is no need to pull task from such kind of
+		 * group because tasks have all compute capacity that they need
+		 * but we can still improve the overall throughput by reducing
+		 * contention when accessing shared HW resources.
+		 *
+		 * XXX for now avg_load is not computed and always 0 so we
+		 * select the 1st one.
+		 */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_has_spare:
+		/*
+		 * Select not overloaded group with lowest number of
+		 * idle cpus. We could also compare the spare capacity
+		 * which is more stable but it can end up that the
+		 * group has less spare capacity but finally more idle
+		 * cpus which means less opportunity to pull tasks.
+		 */
+		if (sgs->idle_cpus >= busiest->idle_cpus)
+			return false;
+		break;
 	}
 
-	return false;
+	/*
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
+	 */
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+		return false;
+
+	return true;
 }
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -8182,13 +8276,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  * @env: The load balancing environment.
  * @sds: variable to hold the statistics for this sched_domain.
  */
+
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 	int sg_status = 0;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -8215,22 +8309,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (local_group)
 			goto next_group;
 
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks. The extra check prevents the case where
-		 * you always pull from the heaviest group when it is already
-		 * under-utilized (possible with a large weight task outweighs
-		 * the tasks on the system).
-		 */
-		if (prefer_sibling && sds->local &&
-		    group_has_capacity(env, local) &&
-		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
-			sgs->group_no_capacity = 1;
-			sgs->group_type = group_classify(sg, sgs);
-		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -8239,13 +8317,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	/* Tag domain that child domain prefers tasks go to siblings first */
+	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+
 #ifdef CONFIG_NO_HZ_COMMON
 	if ((env->flags & LBF_NOHZ_AGAIN) &&
 	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
@@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (busiest->group_asym_packing) {
+	if (busiest->group_type == group_misfit_task) {
+		/* Set imbalance to allow misfit task to be balanced. */
+		env->balance_type = migrate_misfit;
+		env->imbalance = busiest->group_misfit_task_load;
+		return;
+	}
+
+	if (busiest->group_type == group_asym_packing) {
+		/*
+		 * In case of asym capacity, we will try to migrate all load to
+		 * the preferred CPU.
+		 */
+		env->balance_type = migrate_load;
 		env->imbalance = busiest->group_load;
 		return;
 	}
 
+	if (busiest->group_type == group_imbalanced) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure CPU-load equilibrium, try to move any task to fix
+		 * the imbalance. The next load balance will take care of
+		 * balancing back the system.
+		 */
+		env->balance_type = migrate_task;
+		env->imbalance = 1;
+		return;
+	}
+
 	/*
-	 * Avg load of busiest sg can be less and avg load of local sg can
-	 * be greater than avg load across all sgs of sd because avg load
-	 * factors in sg capacity and sgs with smaller group_type are
-	 * skipped when updating the busiest sg:
+	 * Try to use spare capacity of local group without overloading it or
+	 * emptying busiest
 	 */
-	if (busiest->group_type != group_misfit_task &&
-	    (busiest->avg_load <= sds->avg_load ||
-	     local->avg_load >= sds->avg_load)) {
-		env->imbalance = 0;
+	if (local->group_type == group_has_spare) {
+		if (busiest->group_type > group_fully_busy) {
+			/*
+			 * If busiest is overloaded, try to fill spare
+			 * capacity. This might end up creating spare capacity
+			 * in busiest or busiest still being overloaded but
+			 * there is no simple way to directly compute the
+			 * amount of load to migrate in order to balance the
+			 * system.
+			 */
+			env->balance_type = migrate_util;
+			env->imbalance = max(local->group_capacity, local->group_util) -
+				    local->group_util;
+			return;
+		}
+
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->balance_type = migrate_task;
+			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+			return;
+		}
+
+		/*
+		 * If there is no overload, we just want to even the number of
+		 * idle cpus.
+		 */
+		env->balance_type = migrate_task;
+		env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
 		return;
 	}
 
 	/*
-	 * If there aren't any idle CPUs, avoid creating some.
+	 * Local is fully busy but have to take more load to relieve the
+	 * busiest group
 	 */
-	if (busiest->group_type == group_overloaded &&
-	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
-		if (load_above_capacity > busiest->group_capacity) {
-			load_above_capacity -= busiest->group_capacity;
-			load_above_capacity *= scale_load_down(NICE_0_LOAD);
-			load_above_capacity /= busiest->group_capacity;
-		} else
-			load_above_capacity = ~0UL;
+	if (local->group_type < group_overloaded) {
+		/*
+		 * Local will become overloaded so the avg_load metrics are
+		 * finally needed.
+		 */
+
+		local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
+				  local->group_capacity;
+
+		sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
+				sds->total_capacity;
 	}
 
 	/*
-	 * We're trying to get all the CPUs to the average_load, so we don't
-	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded CPU below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group
-	 * capacity. Thus we look for the minimum possible imbalance.
+	 * Both group are or will become overloaded and we're trying to get all
+	 * the CPUs to the average_load, so we don't want to push ourselves
+	 * above the average load, nor do we wish to reduce the max loaded CPU
+	 * below the average load. At the same time, we also don't want to
+	 * reduce the group load below the group capacity. Thus we look for
+	 * the minimum possible imbalance.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
-	/* How much load to actually move to equalise the imbalance */
+	env->balance_type = migrate_load;
 	env->imbalance = min(
-		max_pull * busiest->group_capacity,
+		(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
-
-	/* Boost imbalance to allow misfit task to be balanced. */
-	if (busiest->group_type == group_misfit_task) {
-		env->imbalance = max_t(long, env->imbalance,
-				       busiest->group_misfit_task_load);
-	}
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
 
+/*
+ * Decision matrix according to the local and busiest group state
+ *
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
+ * misfit_task      force     N/A        N/A    N/A  force      force
+ * asym_capacity    force     force      N/A    N/A  force      force
+ * imbalanced       force     force      N/A    N/A  force      force
+ * overloaded       force     force      N/A    N/A  force      avg_load
+ *
+ * N/A :      Not Applicable because already filtered while updating
+ *            statistics.
+ * balanced : The system is balanced for these 2 groups.
+ * force :    Calculate the imbalance as load migration is probably needed.
+ * avg_load : Only if imbalance is significant enough.
+ * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
+ *            different in groups.
+ */
+
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance.
@@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
-	/* ASYM feature bypasses nice load balance check */
-	if (busiest->group_asym_packing)
-		goto force_balance;
-
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_h_nr_running == 0)
+	if (!sds.busiest)
 		goto out_balanced;
 
-	/* XXX broken for overlapping NUMA groups */
-	sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
-						/ sds.total_capacity;
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
+	/* ASYM feature bypasses nice load balance check */
+	if (busiest->group_type == group_asym_packing)
+		goto force_balance;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -8401,55 +8545,64 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/*
-	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
-	 * capacities from resulting in underutilization due to avg_load.
-	 */
-	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
-	    busiest->group_no_capacity)
-		goto force_balance;
-
-	/* Misfit tasks should be dealt with regardless of the avg load */
-	if (busiest->group_type == group_misfit_task)
-		goto force_balance;
-
-	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (local->avg_load >= busiest->avg_load)
+	if (local->group_type > busiest->group_type)
 		goto out_balanced;
 
 	/*
-	 * Don't pull any tasks if this group is already above the domain
-	 * average load.
+	 * When groups are overloaded, use the avg_load to ensure fairness
+	 * between tasks.
 	 */
-	if (local->avg_load >= sds.avg_load)
-		goto out_balanced;
+	if (local->group_type == group_overloaded) {
+		/*
+		 * If the local group is more loaded than the selected
+		 * busiest group don't try and pull any tasks.
+		 */
+		if (local->avg_load >= busiest->avg_load)
+			goto out_balanced;
+
+		/* XXX broken for overlapping NUMA groups */
+		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
+				sds.total_capacity;
 
-	if (env->idle == CPU_IDLE) {
 		/*
-		 * This CPU is idle. If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * Don't pull any tasks if this group is already above the
+		 * domain average load.
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
+		if (local->avg_load >= sds.avg_load)
 			goto out_balanced;
-	} else {
+
 		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
+		 * If the busiest group is more loaded, use imbalance_pct to be
+		 * conservative.
 		 */
 		if (100 * busiest->avg_load <=
 				env->sd->imbalance_pct * local->avg_load)
 			goto out_balanced;
 	}
 
+	/* Try to move all excess tasks to child's sibling domain */
+	if (sds.prefer_sibling && local->group_type == group_has_spare &&
+	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+		goto force_balance;
+
+	if (busiest->group_type != group_overloaded &&
+	     (env->idle == CPU_NOT_IDLE ||
+	      local->idle_cpus <= (busiest->idle_cpus + 1)))
+		/*
+		 * If the busiest group is not overloaded
+		 * and there is no imbalance between this and busiest group
+		 * wrt idle CPUs, it is balanced. The imbalance
+		 * becomes significant if the diff is greater than 1 otherwise
+		 * we might end up to just move the imbalance on another
+		 * group.
+		 */
+		goto out_balanced;
+
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return env->imbalance ? sds.busiest : NULL;
 
@@ -8465,11 +8618,13 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long busiest_load = 0, busiest_capacity = 1;
+	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	unsigned int busiest_nr = 0;
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
-		unsigned long capacity, load;
+		unsigned long capacity, load, util;
+		unsigned int nr_running;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -8497,20 +8652,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		/*
-		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
-		 * seek the "biggest" misfit task.
-		 */
-		if (env->src_grp_type == group_misfit_task) {
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
-				busiest = rq;
-			}
-
-			continue;
-		}
-
 		capacity = capacity_of(i);
+		nr_running = rq->cfs.h_nr_running;
 
 		/*
 		 * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@@ -8520,35 +8663,67 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    capacity_of(env->dst_cpu) < capacity &&
-		    rq->nr_running == 1)
+		    nr_running == 1)
 			continue;
 
-		load = cpu_runnable_load(rq);
+		switch (env->balance_type) {
+		case migrate_load:
+			/*
+			 * When comparing with load imbalance, use cpu_runnable_load()
+			 * which is not scaled with the CPU capacity.
+			 */
+			load = cpu_runnable_load(rq);
 
-		/*
-		 * When comparing with imbalance, use cpu_runnable_load()
-		 * which is not scaled with the CPU capacity.
-		 */
+			if (nr_running == 1 && load > env->imbalance &&
+			    !check_cpu_capacity(rq, env->sd))
+				break;
 
-		if (rq->nr_running == 1 && load > env->imbalance &&
-		    !check_cpu_capacity(rq, env->sd))
-			continue;
+			/*
+			 * For the load comparisons with the other CPU's, consider
+			 * the cpu_runnable_load() scaled with the CPU capacity, so
+			 * that the load can be moved away from the CPU that is
+			 * potentially running at a lower capacity.
+			 *
+			 * Thus we're looking for max(load_i / capacity_i), crosswise
+			 * multiplication to rid ourselves of the division works out
+			 * to: load_i * capacity_j > load_j * capacity_i;  where j is
+			 * our previous maximum.
+			 */
+			if (load * busiest_capacity > busiest_load * capacity) {
+				busiest_load = load;
+				busiest_capacity = capacity;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_util:
+			util = cpu_util(cpu_of(rq));
+
+			if (busiest_util < util) {
+				busiest_util = util;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_task:
+			if (busiest_nr < nr_running) {
+				busiest_nr = nr_running;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_misfit:
+			/*
+			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+			 * seek the "biggest" misfit task.
+			 */
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			break;
 
-		/*
-		 * For the load comparisons with the other CPU's, consider
-		 * the cpu_runnable_load() scaled with the CPU capacity, so
-		 * that the load can be moved away from the CPU that is
-		 * potentially running at a lower capacity.
-		 *
-		 * Thus we're looking for max(load_i / capacity_i), crosswise
-		 * multiplication to rid ourselves of the division works out
-		 * to: load_i * capacity_j > load_j * capacity_i;  where j is
-		 * our previous maximum.
-		 */
-		if (load * busiest_capacity > busiest_load * capacity) {
-			busiest_load = load;
-			busiest_capacity = capacity;
-			busiest = rq;
 		}
 	}
 
@@ -8594,7 +8769,7 @@ voluntary_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->src_grp_type == group_misfit_task)
+	if (env->balance_type == migrate_misfit)
 		return 1;
 
 	return 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 05/10] sched/fair: use rq->nr_running when balancing load
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (3 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 06/10] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

cfs load_balance only takes care of CFS tasks whereas CPUs can be used by
other scheduling class. Typically, a CFS task preempted by a RT or deadline
task will not get a chance to be pulled on another CPU because the
load_balance doesn't take into account tasks from other classes.
Add sum of nr_running in the statistics and use it to detect such
situation.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d33379c..7e74836 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7716,6 +7716,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
+	unsigned int sum_nr_running; /* Nr of tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
@@ -7949,7 +7950,7 @@ static inline int sg_imbalanced(struct sched_group *group)
 static inline bool
 group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_h_nr_running < sgs->group_weight)
+	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
@@ -7970,7 +7971,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_h_nr_running <= sgs->group_weight)
+	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
@@ -8074,6 +8075,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
+		sgs->sum_nr_running += nr_running;
+
 		if (nr_running > 1)
 			*sg_status |= SG_OVERLOAD;
 
@@ -8423,7 +8426,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			 * groups.
 			 */
 			env->balance_type = migrate_task;
-			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+			env->imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1;
 			return;
 		}
 
@@ -8585,7 +8588,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 	/* Try to move all excess tasks to child's sibling domain */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 06/10] sched/fair: use load instead of runnable load in load_balance
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (4 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 05/10] sched/fair: use rq->nr_running when balancing load Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 07/10] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

runnable load has been introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized group with huge blocked load whereas other groups were
overloaded.

The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wakeup on the cpu.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e74836..15ec38c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5385,6 +5385,11 @@ static unsigned long cpu_runnable_load(struct rq *rq)
 	return cfs_rq_runnable_load_avg(&rq->cfs);
 }
 
+static unsigned long cpu_load(struct rq *rq)
+{
+	return cfs_rq_load_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -8070,7 +8075,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
 			env->flags |= LBF_NOHZ_AGAIN;
 
-		sgs->group_load += cpu_runnable_load(rq);
+		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
@@ -8512,7 +8517,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	init_sd_lb_stats(&sds);
 
 	/*
-	 * Compute the various statistics relavent for load balancing at
+	 * Compute the various statistics relevant for load balancing at
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
@@ -8672,10 +8677,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		switch (env->balance_type) {
 		case migrate_load:
 			/*
-			 * When comparing with load imbalance, use cpu_runnable_load()
+			 * When comparing with load imbalance, use cpu_load()
 			 * which is not scaled with the CPU capacity.
 			 */
-			load = cpu_runnable_load(rq);
+			load = cpu_load(rq);
 
 			if (nr_running == 1 && load > env->imbalance &&
 			    !check_cpu_capacity(rq, env->sd))
@@ -8683,7 +8688,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 			/*
 			 * For the load comparisons with the other CPU's, consider
-			 * the cpu_runnable_load() scaled with the CPU capacity, so
+			 * the cpu_load() scaled with the CPU capacity, so
 			 * that the load can be moved away from the CPU that is
 			 * potentially running at a lower capacity.
 			 *
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 07/10] sched/fair: evenly spread tasks when not overloaded
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (5 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 06/10] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-09-19  7:33 ` [PATCH v3 08/10] sched/fair: use utilization to select misfit task Vincent Guittot
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

When there is only 1 cpu per group, using the idle cpus to evenly spread
tasks doesn't make sense and nr_running is a better metrics.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15ec38c..a7c8ee6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,18 +8596,34 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
 
-	if (busiest->group_type != group_overloaded &&
-	     (env->idle == CPU_NOT_IDLE ||
-	      local->idle_cpus <= (busiest->idle_cpus + 1)))
-		/*
-		 * If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance
-		 * becomes significant if the diff is greater than 1 otherwise
-		 * we might end up to just move the imbalance on another
-		 * group.
-		 */
-		goto out_balanced;
+	if (busiest->group_type != group_overloaded) {
+		if (env->idle == CPU_NOT_IDLE)
+			/*
+			 * If the busiest group is not overloaded (and as a
+			 * result the local one too) but this cpu is already
+			 * busy, let another idle cpu try to pull task.
+			 */
+			goto out_balanced;
+
+		if (busiest->group_weight > 1 &&
+		    local->idle_cpus <= (busiest->idle_cpus + 1))
+			/*
+			 * If the busiest group is not overloaded
+			 * and there is no imbalance between this and busiest
+			 * group wrt idle CPUs, it is balanced. The imbalance
+			 * becomes significant if the diff is greater than 1
+			 * otherwise we might end up to just move the imbalance
+			 * on another group. Of course this applies only if
+			 * there is more than 1 CPU per group.
+			 */
+			goto out_balanced;
+
+		if (busiest->sum_h_nr_running == 1)
+			/*
+			 * busiest doesn't have any tasks waiting to run
+			 */
+			goto out_balanced;
+	}
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 08/10] sched/fair: use utilization to select misfit task
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (6 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 07/10] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-10-01 17:12   ` Valentin Schneider
  2019-09-19  7:33 ` [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7c8ee6..acca869 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7429,14 +7429,8 @@ static int detach_tasks(struct lb_env *env)
 			break;
 
 		case migrate_misfit:
-			load = task_h_load(p);
-
-			/*
-			 * utilization of misfit task might decrease a bit
-			 * since it has been recorded. Be conservative in the
-			 * condition.
-			 */
-			if (load < env->imbalance)
+			/* This is not a misfit task */
+			if (task_fits_capacity(p, capacity_of(env->src_cpu)))
 				goto next;
 
 			env->imbalance = 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (7 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 08/10] sched/fair: use utilization to select misfit task Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-10-07 15:14   ` Rik van Riel
  2019-09-19  7:33 ` [PATCH v3 10/10] sched/fair: optimize find_idlest_group Vincent Guittot
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

runnable load has been introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.

Tha wake up path now starts to looks for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acca869..39a37ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5485,7 +5485,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	this_eff_load = cpu_runnable_load(cpu_rq(this_cpu));
+	this_eff_load = cpu_load(cpu_rq(this_cpu));
 
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
@@ -5503,7 +5503,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load *= 100;
 	this_eff_load *= capacity_of(prev_cpu);
 
-	prev_eff_load = cpu_runnable_load(cpu_rq(prev_cpu));
+	prev_eff_load = cpu_load(cpu_rq(prev_cpu));
 	prev_eff_load -= task_load;
 	if (sched_feat(WA_BIAS))
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
@@ -5591,7 +5591,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_span(group)) {
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			runnable_load += load;
 
 			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
@@ -5732,7 +5732,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				continue;
 			}
 
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			if (load < min_load) {
 				min_load = load;
 				least_loaded_cpu = i;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 10/10] sched/fair: optimize find_idlest_group
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (8 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
@ 2019-09-19  7:33 ` Vincent Guittot
  2019-10-08 14:32 ` [PATCH v3 0/8] sched/fair: rework the CFS load balance Phil Auld
  2019-10-16  7:21 ` Parth Shah
  11 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-19  7:33 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, Vincent Guittot

find_idlest_group() now loads CPU's load_avg in 2 different ways.
Consolidate the function to read and use load_avg only once and simplify
the algorithm to only look for the group with lowest load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 52 +++++++++++-----------------------------------------
 1 file changed, 11 insertions(+), 41 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39a37ae..1fac444 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5560,16 +5560,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_runnable_load = ULONG_MAX;
-	unsigned long this_runnable_load = ULONG_MAX;
-	unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
+	unsigned long min_load = ULONG_MAX, this_load = ULONG_MAX;
 	unsigned long most_spare = 0, this_spare = 0;
 	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
 	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
 				(sd->imbalance_pct-100) / 100;
 
 	do {
-		unsigned long load, avg_load, runnable_load;
+		unsigned long load;
 		unsigned long spare_cap, max_spare_cap;
 		int local_group;
 		int i;
@@ -5586,15 +5584,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		 * Tally up the load of all CPUs in the group and find
 		 * the group containing the CPU with most spare capacity.
 		 */
-		avg_load = 0;
-		runnable_load = 0;
+		load = 0;
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_span(group)) {
-			load = cpu_load(cpu_rq(i));
-			runnable_load += load;
-
-			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
+			load += cpu_load(cpu_rq(i));
 
 			spare_cap = capacity_spare_without(i, p);
 
@@ -5603,31 +5597,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU capacity of the group */
-		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
-					group->sgc->capacity;
-		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
+		load = (load * SCHED_CAPACITY_SCALE) /
 					group->sgc->capacity;
 
 		if (local_group) {
-			this_runnable_load = runnable_load;
-			this_avg_load = avg_load;
+			this_load = load;
 			this_spare = max_spare_cap;
 		} else {
-			if (min_runnable_load > (runnable_load + imbalance)) {
-				/*
-				 * The runnable load is significantly smaller
-				 * so we can pick this new CPU:
-				 */
-				min_runnable_load = runnable_load;
-				min_avg_load = avg_load;
-				idlest = group;
-			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
-				   (100*min_avg_load > imbalance_scale*avg_load)) {
-				/*
-				 * The runnable loads are close so take the
-				 * blocked load into account through avg_load:
-				 */
-				min_avg_load = avg_load;
+			if (load < min_load) {
+				min_load = load;
 				idlest = group;
 			}
 
@@ -5668,18 +5646,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	 * local domain to be very lightly loaded relative to the remote
 	 * domains but "imbalance" skews the comparison making remote CPUs
 	 * look much more favourable. When considering cross-domain, add
-	 * imbalance to the runnable load on the remote node and consider
-	 * staying local.
+	 * imbalance to the load on the remote node and consider staying
+	 * local.
 	 */
-	if ((sd->flags & SD_NUMA) &&
-	    min_runnable_load + imbalance >= this_runnable_load)
-		return NULL;
-
-	if (min_runnable_load > (this_runnable_load + imbalance))
-		return NULL;
-
-	if ((this_runnable_load < (min_runnable_load + imbalance)) &&
-	     (100*this_avg_load < imbalance_scale*min_avg_load))
+	if (min_load + imbalance >= this_load)
 		return NULL;
 
 	return idlest;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 01/10] sched/fair: clean up asym packing
  2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
@ 2019-09-27 23:57   ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2019-09-27 23:57 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> Clean up asym packing to follow the default load balance behavior:
> - classify the group by creating a group_asym_packing field.
> - calculate the imbalance in calculate_imbalance() instead of
> bypassing it.
> 
> We don't need to test twice same conditions anymore to detect asym
> packing
> and we consolidate the calculation of imbalance in
> calculate_imbalance().
> 
> There is no functional changes.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running
  2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
@ 2019-09-27 23:59   ` Rik van Riel
  2019-10-01 17:11   ` Valentin Schneider
  1 sibling, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2019-09-27 23:59 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> Rename sum_nr_running to sum_h_nr_running because it effectively
> tracks
> cfs->h_nr_running so we can use sum_nr_running to track rq-
> >nr_running
> when needed.
> 
> There is no functional changes.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
Acked-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation
  2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
@ 2019-09-28  0:05   ` Rik van Riel
  2019-10-01 17:12   ` Valentin Schneider
  1 sibling, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2019-09-28  0:05 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton

[-- Attachment #1: Type: text/plain, Size: 314 bytes --]

On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> clean up load_balance and remove meaningless calculation and fields
> before
> adding new algorithm.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Yay.

Acked-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
@ 2019-09-30  1:12   ` Rik van Riel
  2019-09-30  7:44     ` Vincent Guittot
  2019-09-30 16:24   ` Dietmar Eggemann
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 57+ messages in thread
From: Rik van Riel @ 2019-09-30  1:12 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton

[-- Attachment #1: Type: text/plain, Size: 657 bytes --]

On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> 
> Also the load balance decisions have been consolidated in the 3
> functions
> below after removing the few bypasses and hacks of the current code:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imbalance between local
> and
>   busiest group.
> - calculate_imbalance() decides what have to be moved.

I really like the direction this series is going.

However, I suppose I should run these patches for
a few days with some of our test workloads before
I send out an ack for this patch :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-30  1:12   ` Rik van Riel
@ 2019-09-30  7:44     ` Vincent Guittot
  0 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-09-30  7:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Hillf Danton

On Mon, 30 Sep 2019 at 03:13, Rik van Riel <riel@surriel.com> wrote:
>
> On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> >
> > Also the load balance decisions have been consolidated in the 3
> > functions
> > below after removing the few bypasses and hacks of the current code:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imbalance between local
> > and
> >   busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> I really like the direction this series is going.

Thanks

>
> However, I suppose I should run these patches for
> a few days with some of our test workloads before
> I send out an ack for this patch :)

Yes more tests on different platform are welcome

>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
  2019-09-30  1:12   ` Rik van Riel
@ 2019-09-30 16:24   ` Dietmar Eggemann
  2019-10-01  8:14     ` Vincent Guittot
  2019-10-01  8:15   ` Dietmar Eggemann
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 57+ messages in thread
From: Dietmar Eggemann @ 2019-09-30 16:24 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	Morten.Rasmussen, hdanton

Hi Vincent,

On 19/09/2019 09:33, Vincent Guittot wrote:

these are just some comments & questions based on a code study. Haven't
run any tests with it yet.

[...]

> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
>         group_has_spare
>         group_fully_busy
>         group_misfit_task
>         group_asym_capacity

s/group_asym_capacity/group_asym_packing

>         group_imbalanced
>         group_overloaded
>
> Based on the type fo sched_group, load_balance now sets what it wants to

s/fo/for

> move in order to fix the imbalance. It can be some load as before but also
> some utilization, a number of task or a type of task:
>         migrate_task
>         migrate_util
>         migrate_load
>         migrate_misfit
>
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymmetrics system

s/asymmetrics/asymmetric

This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?

[...]

>   #define LBF_ALL_PINNED       0x01
> @@ -7115,7 +7130,7 @@ struct lb_env {
>         unsigned int            loop_max;
>  
>         enum fbq_type           fbq_type;
> -     enum group_type         src_grp_type;
> +     enum migration_type     balance_type;

Minor thing:

Why not
     enum migration_type migration_type;
or
     enum balance_type balance_type;

We do the same for other enums like fbq_type or group_type.

>         struct list_head        tasks;
>   };
>  

The detach_tasks() comment still mentions only runnable load.

> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>   {
>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>         struct task_struct *p;
> -     unsigned long load;
> +     unsigned long util, load;

Minor: Order by length or reduce scope to while loop ?

>         int detached = 0;
>  
>         lockdep_assert_held(&env->src_rq->lock);
> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
>                 if (!can_migrate_task(p, env))
>                         goto next;
>  
> -             load = task_h_load(p);
> +             switch (env->balance_type) {
> +             case migrate_load:
> +                     load = task_h_load(p);
>  
> -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> -                     goto next;
> +                     if (sched_feat(LB_MIN) &&
> +                         load < 16 && !env->sd->nr_balance_failed)
> +                             goto next;
>  
> -             if ((load / 2) > env->imbalance)
> -                     goto next;
> +                     if ((load / 2) > env->imbalance)
> +                             goto next;
> +
> +                     env->imbalance -= load;
> +                     break;
> +
> +             case migrate_util:
> +                     util = task_util_est(p);
> +
> +                     if (util > env->imbalance)
> +                             goto next;
> +
> +                     env->imbalance -= util;
> +                     break;
> +
> +             case migrate_task:
> +                     /* Migrate task */

Minor: IMHO, this comment is not necessary.

> +                     env->imbalance--;
> +                     break;
> +
> +             case migrate_misfit:
> +                     load = task_h_load(p);
> +
> +                     /*
> +                      * utilization of misfit task might decrease a bit

This patch still uses load. IMHO this comments becomes true only with 08/10.

[...]

> @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>         *sds = (struct sd_lb_stats){
>                 .busiest = NULL,
>                 .local = NULL,
> -             .total_running = 0UL,
>                 .total_load = 0UL,
>                 .total_capacity = 0UL,
>                 .busiest_stat = {
> -                     .avg_load = 0UL,

There is a sentence in the comment above explaining why avg_load has to
be cleared. And IMHO local group isn't cleared anymore but set/initialized.

> -                     .sum_h_nr_running = 0,
> -                     .group_type = group_other,
> +                     .idle_cpus = UINT_MAX,
> +                     .group_type = group_has_spare,
>                 },
>         };
>   }

[...]

> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 }
>         }
>  
> -     /* Adjust by relative CPU capacity of the group */
> +     /* Check if dst cpu is idle and preferred to this group */

s/preferred to/preferred by ? or the preferred CPU of this group ?

[...]

> @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>    */
>   static inline void calculate_imbalance(struct lb_env *env, struct
sd_lb_stats *sds)
>   {
> -     unsigned long max_pull, load_above_capacity = ~0UL;
>         struct sg_lb_stats *local, *busiest;
>  
>         local = &sds->local_stat;
>         busiest = &sds->busiest_stat;
>  
> -     if (busiest->group_asym_packing) {
> +     if (busiest->group_type == group_misfit_task) {
> +             /* Set imbalance to allow misfit task to be balanced. */
> +             env->balance_type = migrate_misfit;
> +             env->imbalance = busiest->group_misfit_task_load;
> +             return;
> +     }
> +
> +     if (busiest->group_type == group_asym_packing) {
> +             /*
> +              * In case of asym capacity, we will try to migrate all load to

Does asym capacity stands for asym packing or asym cpu capacity?

> +              * the preferred CPU.
> +              */
> +             env->balance_type = migrate_load;
>                 env->imbalance = busiest->group_load;
>                 return;
>         }
>  
> +     if (busiest->group_type == group_imbalanced) {
> +             /*
> +              * In the group_imb case we cannot rely on group-wide averages
> +              * to ensure CPU-load equilibrium, try to move any task to fix
> +              * the imbalance. The next load balance will take care of
> +              * balancing back the system.

balancing back ?

> +              */
> +             env->balance_type = migrate_task;
> +             env->imbalance = 1;
> +             return;
> +     }
> +
>         /*
> -      * Avg load of busiest sg can be less and avg load of local sg can
> -      * be greater than avg load across all sgs of sd because avg load
> -      * factors in sg capacity and sgs with smaller group_type are
> -      * skipped when updating the busiest sg:
> +      * Try to use spare capacity of local group without overloading it or
> +      * emptying busiest
>          */
> -     if (busiest->group_type != group_misfit_task &&
> -         (busiest->avg_load <= sds->avg_load ||
> -          local->avg_load >= sds->avg_load)) {
> -             env->imbalance = 0;
> +     if (local->group_type == group_has_spare) {
> +             if (busiest->group_type > group_fully_busy) {

So this could be 'busiest->group_type == group_overloaded' here to match
the comment below? Since you handle group_misfit_task,
group_asym_packing, group_imbalanced above and return.

> +                     /*
> +                      * If busiest is overloaded, try to fill spare
> +                      * capacity. This might end up creating spare capacity
> +                      * in busiest or busiest still being overloaded but
> +                      * there is no simple way to directly compute the
> +                      * amount of load to migrate in order to balance the
> +                      * system.
> +                      */
> +                     env->balance_type = migrate_util;
> +                     env->imbalance = max(local->group_capacity, local->group_util) -
> +                                 local->group_util;
> +                     return;
> +             }
> +
> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> +                     /*
> +                      * When prefer sibling, evenly spread running tasks on
> +                      * groups.
> +                      */
> +                     env->balance_type = migrate_task;
> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> +                     return;
> +             }
> +
> +             /*
> +              * If there is no overload, we just want to even the number of
> +              * idle cpus.
> +              */
> +             env->balance_type = migrate_task;
> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);

Why do we need a max_t(long, 0, ...) here and not for the 'if
(busiest->group_weight == 1 || sds->prefer_sibling)' case?

>                 return;
>         }
>  
>         /*
> -      * If there aren't any idle CPUs, avoid creating some.
> +      * Local is fully busy but have to take more load to relieve the

s/have/has

> +      * busiest group
>          */

I thought that 'local->group_type == group_imbalanced' is allowed as
well? So 'if (local->group_type < group_overloaded)' (further down)
could include that?

> -     if (busiest->group_type == group_overloaded &&
> -         local->group_type   == group_overloaded) {
> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> -             if (load_above_capacity > busiest->group_capacity) {
> -                     load_above_capacity -= busiest->group_capacity;
> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
> -                     load_above_capacity /= busiest->group_capacity;
> -             } else
> -                     load_above_capacity = ~0UL;
> +     if (local->group_type < group_overloaded) {
> +             /*
> +              * Local will become overloaded so the avg_load metrics are
> +              * finally needed.
> +              */

How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
It would be nice to be able to match all allowed fields of dm to code sections.

[...]

>   /******* find_busiest_group() helpers end here *********************/
>  
> +/*
> + * Decision matrix according to the local and busiest group state

Minor s/state/type ?

> + *
> + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> + * misfit_task      force     N/A        N/A    N/A  force      force
> + * asym_capacity    force     force      N/A    N/A  force      force

s/asym_capacity/asym_packing

> + * imbalanced       force     force      N/A    N/A  force      force
> + * overloaded       force     force      N/A    N/A  force      avg_load
> + *
> + * N/A :      Not Applicable because already filtered while updating
> + *            statistics.
> + * balanced : The system is balanced for these 2 groups.
> + * force :    Calculate the imbalance as load migration is probably needed.
> + * avg_load : Only if imbalance is significant enough.
> + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
> + *            different in groups.
> + */
> +
>   /**
>    * find_busiest_group - Returns the busiest group within the sched_domain
>    * if there is an imbalance.
> @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>         local = &sds.local_stat;
>         busiest = &sds.busiest_stat;
>  
> -     /* ASYM feature bypasses nice load balance check */
> -     if (busiest->group_asym_packing)
> -             goto force_balance;
> -
>         /* There is no busy sibling group to pull tasks from */
> -     if (!sds.busiest || busiest->sum_h_nr_running == 0)
> +     if (!sds.busiest)
>                 goto out_balanced;
>  
> -     /* XXX broken for overlapping NUMA groups */
> -     sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
> -                                             / sds.total_capacity;
> +     /* Misfit tasks should be dealt with regardless of the avg load */
> +     if (busiest->group_type == group_misfit_task)
> +             goto force_balance;
> +
> +     /* ASYM feature bypasses nice load balance check */

Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from
ASYM_CPUCAPACITY.

> +     if (busiest->group_type == group_asym_packing)
> +             goto force_balance;

[...]


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-30 16:24   ` Dietmar Eggemann
@ 2019-10-01  8:14     ` Vincent Guittot
  2019-10-01 16:52       ` Dietmar Eggemann
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-01  8:14 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> Hi Vincent,
>
> On 19/09/2019 09:33, Vincent Guittot wrote:
>
> these are just some comments & questions based on a code study. Haven't
> run any tests with it yet.
>
> [...]
>
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> >         group_has_spare
> >         group_fully_busy
> >         group_misfit_task
> >         group_asym_capacity
>
> s/group_asym_capacity/group_asym_packing

Yes, I forgot to change the commit message and the comments

>
> >         group_imbalanced
> >         group_overloaded
> >
> > Based on the type fo sched_group, load_balance now sets what it wants to
>
> s/fo/for

s/fo/of/

>
> > move in order to fix the imbalance. It can be some load as before but also
> > some utilization, a number of task or a type of task:
> >         migrate_task
> >         migrate_util
> >         migrate_load
> >         migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymmetrics system
>
> s/asymmetrics/asymmetric

yes

>
> This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?

yes

>
> [...]
>
> >   #define LBF_ALL_PINNED       0x01
> > @@ -7115,7 +7130,7 @@ struct lb_env {
> >         unsigned int            loop_max;
> >
> >         enum fbq_type           fbq_type;
> > -     enum group_type         src_grp_type;
> > +     enum migration_type     balance_type;
>
> Minor thing:
>
> Why not
>      enum migration_type migration_type;
> or
>      enum balance_type balance_type;
>
> We do the same for other enums like fbq_type or group_type.

yes, I can align

>
> >         struct list_head        tasks;
> >   };
> >
>
> The detach_tasks() comment still mentions only runnable load.

ok

>
> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> >   {
> >         struct list_head *tasks = &env->src_rq->cfs_tasks;
> >         struct task_struct *p;
> > -     unsigned long load;
> > +     unsigned long util, load;
>
> Minor: Order by length or reduce scope to while loop ?

I don't get your point here

>
> >         int detached = 0;
> >
> >         lockdep_assert_held(&env->src_rq->lock);
> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> >                 if (!can_migrate_task(p, env))
> >                         goto next;
> >
> > -             load = task_h_load(p);
> > +             switch (env->balance_type) {
> > +             case migrate_load:
> > +                     load = task_h_load(p);
> >
> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > -                     goto next;
> > +                     if (sched_feat(LB_MIN) &&
> > +                         load < 16 && !env->sd->nr_balance_failed)
> > +                             goto next;
> >
> > -             if ((load / 2) > env->imbalance)
> > -                     goto next;
> > +                     if ((load / 2) > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= load;
> > +                     break;
> > +
> > +             case migrate_util:
> > +                     util = task_util_est(p);
> > +
> > +                     if (util > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= util;
> > +                     break;
> > +
> > +             case migrate_task:
> > +                     /* Migrate task */
>
> Minor: IMHO, this comment is not necessary.

yes

>
> > +                     env->imbalance--;
> > +                     break;
> > +
> > +             case migrate_misfit:
> > +                     load = task_h_load(p);
> > +
> > +                     /*
> > +                      * utilization of misfit task might decrease a bit
>
> This patch still uses load. IMHO this comments becomes true only with 08/10.

even with 8/10 it's not correct and it has been removed.
I'm going to remove it.

>
> [...]
>
> > @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> >         *sds = (struct sd_lb_stats){
> >                 .busiest = NULL,
> >                 .local = NULL,
> > -             .total_running = 0UL,
> >                 .total_load = 0UL,
> >                 .total_capacity = 0UL,
> >                 .busiest_stat = {
> > -                     .avg_load = 0UL,
>
> There is a sentence in the comment above explaining why avg_load has to
> be cleared. And IMHO local group isn't cleared anymore but set/initialized.

Yes, I have to update it

>
> > -                     .sum_h_nr_running = 0,
> > -                     .group_type = group_other,
> > +                     .idle_cpus = UINT_MAX,
> > +                     .group_type = group_has_spare,
> >                 },
> >         };
> >   }
>
> [...]
>
> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >                 }
> >         }
> >
> > -     /* Adjust by relative CPU capacity of the group */
> > +     /* Check if dst cpu is idle and preferred to this group */
>
> s/preferred to/preferred by ? or the preferred CPU of this group ?

dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
this group vs dst_cpu which belongs to another group

>
> [...]
>
> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >    */
> >   static inline void calculate_imbalance(struct lb_env *env, struct
> sd_lb_stats *sds)
> >   {
> > -     unsigned long max_pull, load_above_capacity = ~0UL;
> >         struct sg_lb_stats *local, *busiest;
> >
> >         local = &sds->local_stat;
> >         busiest = &sds->busiest_stat;
> >
> > -     if (busiest->group_asym_packing) {
> > +     if (busiest->group_type == group_misfit_task) {
> > +             /* Set imbalance to allow misfit task to be balanced. */
> > +             env->balance_type = migrate_misfit;
> > +             env->imbalance = busiest->group_misfit_task_load;
> > +             return;
> > +     }
> > +
> > +     if (busiest->group_type == group_asym_packing) {
> > +             /*
> > +              * In case of asym capacity, we will try to migrate all load to
>
> Does asym capacity stands for asym packing or asym cpu capacity?

busiest->group_type == group_asym_packing

will fix it

>
> > +              * the preferred CPU.
> > +              */
> > +             env->balance_type = migrate_load;
> >                 env->imbalance = busiest->group_load;
> >                 return;
> >         }
> >
> > +     if (busiest->group_type == group_imbalanced) {
> > +             /*
> > +              * In the group_imb case we cannot rely on group-wide averages
> > +              * to ensure CPU-load equilibrium, try to move any task to fix
> > +              * the imbalance. The next load balance will take care of
> > +              * balancing back the system.
>
> balancing back ?

In case of imbalance, we don't try to balance the system but only try
to get rid of the pinned tasks problem. The system will still be
unbalanced after the migration and the next load balance will take
care of balancing the system

>
> > +              */
> > +             env->balance_type = migrate_task;
> > +             env->imbalance = 1;
> > +             return;
> > +     }
> > +
> >         /*
> > -      * Avg load of busiest sg can be less and avg load of local sg can
> > -      * be greater than avg load across all sgs of sd because avg load
> > -      * factors in sg capacity and sgs with smaller group_type are
> > -      * skipped when updating the busiest sg:
> > +      * Try to use spare capacity of local group without overloading it or
> > +      * emptying busiest
> >          */
> > -     if (busiest->group_type != group_misfit_task &&
> > -         (busiest->avg_load <= sds->avg_load ||
> > -          local->avg_load >= sds->avg_load)) {
> > -             env->imbalance = 0;
> > +     if (local->group_type == group_has_spare) {
> > +             if (busiest->group_type > group_fully_busy) {
>
> So this could be 'busiest->group_type == group_overloaded' here to match
> the comment below? Since you handle group_misfit_task,
> group_asym_packing, group_imbalanced above and return.

This is just to be more robust in case some new states are added later

>
> > +                     /*
> > +                      * If busiest is overloaded, try to fill spare
> > +                      * capacity. This might end up creating spare capacity
> > +                      * in busiest or busiest still being overloaded but
> > +                      * there is no simple way to directly compute the
> > +                      * amount of load to migrate in order to balance the
> > +                      * system.
> > +                      */
> > +                     env->balance_type = migrate_util;
> > +                     env->imbalance = max(local->group_capacity, local->group_util) -
> > +                                 local->group_util;
> > +                     return;
> > +             }
> > +
> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > +                     /*
> > +                      * When prefer sibling, evenly spread running tasks on
> > +                      * groups.
> > +                      */
> > +                     env->balance_type = migrate_task;
> > +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> > +                     return;
> > +             }
> > +
> > +             /*
> > +              * If there is no overload, we just want to even the number of
> > +              * idle cpus.
> > +              */
> > +             env->balance_type = migrate_task;
> > +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>
> Why do we need a max_t(long, 0, ...) here and not for the 'if
> (busiest->group_weight == 1 || sds->prefer_sibling)' case?

For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

either we have sds->prefer_sibling && busiest->sum_nr_running >
local->sum_nr_running + 1

or busiest->group_weight == 1 and env->idle != CPU_NOT_IDLE so local
cpu is idle or newly idle

>
> >                 return;
> >         }
> >
> >         /*
> > -      * If there aren't any idle CPUs, avoid creating some.
> > +      * Local is fully busy but have to take more load to relieve the
>
> s/have/has
>
> > +      * busiest group
> >          */
>
> I thought that 'local->group_type == group_imbalanced' is allowed as
> well? So 'if (local->group_type < group_overloaded)' (further down)
> could include that?

yes.
Imbalance state is not very useful for local group but only reflects
that the group is not overloaded so either fully busy or has spare
capacity.
In this case we assume the worst : fully_busy

>
> > -     if (busiest->group_type == group_overloaded &&
> > -         local->group_type   == group_overloaded) {
> > -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> > -             if (load_above_capacity > busiest->group_capacity) {
> > -                     load_above_capacity -= busiest->group_capacity;
> > -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
> > -                     load_above_capacity /= busiest->group_capacity;
> > -             } else
> > -                     load_above_capacity = ~0UL;
> > +     if (local->group_type < group_overloaded) {
> > +             /*
> > +              * Local will become overloaded so the avg_load metrics are
> > +              * finally needed.
> > +              */
>
> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> It would be nice to be able to match all allowed fields of dm to code sections.

decision_matrix describes how it decides between balanced or unbalanced.
In case of dm[overload, overload], we use the avg_load to decide if it
is balanced or not
In case of dm[fully_busy, overload], the groups are unbalanced because
fully_busy < overload and we force the balance. Then
calculate_imbalance() uses the avg_load to decide how much will be
moved

dm[overload, overload]=force means that we force the balance and we
will compute later the imbalance. avg_load may be used to calculate
the imbalance
dm[overload, overload]=avg_load means that we compare the avg_load to
decide whether we need to balance load between groups
dm[overload, overload]=nr_idle means that we compare the number of
idle cpus to decide whether we need to balance.  In fact this is no
more true with patch 7 because we also take into account the number of
nr_h_running when weight =1

>
> [...]
>
> >   /******* find_busiest_group() helpers end here *********************/
> >
> > +/*
> > + * Decision matrix according to the local and busiest group state
>
> Minor s/state/type ?

ok

>
> > + *
> > + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> > + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> > + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> > + * misfit_task      force     N/A        N/A    N/A  force      force
> > + * asym_capacity    force     force      N/A    N/A  force      force
>
> s/asym_capacity/asym_packing

yes

>
> > + * imbalanced       force     force      N/A    N/A  force      force
> > + * overloaded       force     force      N/A    N/A  force      avg_load
> > + *
> > + * N/A :      Not Applicable because already filtered while updating
> > + *            statistics.
> > + * balanced : The system is balanced for these 2 groups.
> > + * force :    Calculate the imbalance as load migration is probably needed.
> > + * avg_load : Only if imbalance is significant enough.
> > + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
> > + *            different in groups.
> > + */
> > +
> >   /**
> >    * find_busiest_group - Returns the busiest group within the sched_domain
> >    * if there is an imbalance.
> > @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >         local = &sds.local_stat;
> >         busiest = &sds.busiest_stat;
> >
> > -     /* ASYM feature bypasses nice load balance check */
> > -     if (busiest->group_asym_packing)
> > -             goto force_balance;
> > -
> >         /* There is no busy sibling group to pull tasks from */
> > -     if (!sds.busiest || busiest->sum_h_nr_running == 0)
> > +     if (!sds.busiest)
> >                 goto out_balanced;
> >
> > -     /* XXX broken for overlapping NUMA groups */
> > -     sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
> > -                                             / sds.total_capacity;
> > +     /* Misfit tasks should be dealt with regardless of the avg load */
> > +     if (busiest->group_type == group_misfit_task)
> > +             goto force_balance;
> > +
> > +     /* ASYM feature bypasses nice load balance check */
>
> Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from
> ASYM_CPUCAPACITY.

yes

>
> > +     if (busiest->group_type == group_asym_packing)
> > +             goto force_balance;
>
> [...]
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
  2019-09-30  1:12   ` Rik van Riel
  2019-09-30 16:24   ` Dietmar Eggemann
@ 2019-10-01  8:15   ` Dietmar Eggemann
  2019-10-01  9:14     ` Vincent Guittot
  2019-10-01 17:47   ` Valentin Schneider
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 57+ messages in thread
From: Dietmar Eggemann @ 2019-10-01  8:15 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	Morten.Rasmussen, hdanton

On 19/09/2019 09:33, Vincent Guittot wrote:


[...]

> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		}
>  	}
>  
> -	/* Adjust by relative CPU capacity of the group */
> +	/* Check if dst cpu is idle and preferred to this group */
> +	if (env->sd->flags & SD_ASYM_PACKING &&
> +	    env->idle != CPU_NOT_IDLE &&
> +	    sgs->sum_h_nr_running &&
> +	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> +		sgs->group_asym_packing = 1;
> +	}
> +

Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
not check for asym_packing in group_classify() directly (like for overloaded) and
so get rid of struct sg_lb_stats::group_asym_packing?

Something like (only compile tested).

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..b2848d6f8a2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7692,7 +7692,6 @@ struct sg_lb_stats {
        unsigned int idle_cpus;
        unsigned int group_weight;
        enum group_type group_type;
-       unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
        unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
        unsigned int nr_numa_running;
@@ -7952,6 +7951,20 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
        return false;
 }
 
+static inline bool
+group_has_asym_packing(struct lb_env *env, struct sched_group *sg,
+                      struct sg_lb_stats *sgs)
+{
+       if (env->sd->flags & SD_ASYM_PACKING &&
+           env->idle != CPU_NOT_IDLE &&
+           sgs->sum_h_nr_running &&
+           sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+               return true;
+       }
+
+       return false;
+}
+
 /*
  * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
@@ -7983,7 +7996,7 @@ group_type group_classify(struct lb_env *env,
        if (sg_imbalanced(group))
                return group_imbalanced;
 
-       if (sgs->group_asym_packing)
+       if (group_has_asym_packing(env, group, sgs))
                return group_asym_packing;
 
        if (sgs->group_misfit_task_load)
@@ -8076,14 +8089,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
                }
        }
 
-       /* Check if dst cpu is idle and preferred to this group */
-       if (env->sd->flags & SD_ASYM_PACKING &&
-           env->idle != CPU_NOT_IDLE &&
-           sgs->sum_h_nr_running &&
-           sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
-               sgs->group_asym_packing = 1;
-       }
-
        sgs->group_capacity = group->sgc->capacity;

[...]

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-01  8:15   ` Dietmar Eggemann
@ 2019-10-01  9:14     ` Vincent Guittot
  2019-10-01 16:57       ` Dietmar Eggemann
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-01  9:14 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

group_asym_packing

On Tue, 1 Oct 2019 at 10:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 19/09/2019 09:33, Vincent Guittot wrote:
>
>
> [...]
>
> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >               }
> >       }
> >
> > -     /* Adjust by relative CPU capacity of the group */
> > +     /* Check if dst cpu is idle and preferred to this group */
> > +     if (env->sd->flags & SD_ASYM_PACKING &&
> > +         env->idle != CPU_NOT_IDLE &&
> > +         sgs->sum_h_nr_running &&
> > +         sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> > +             sgs->group_asym_packing = 1;
> > +     }
> > +
>
> Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
> not check for asym_packing in group_classify() directly (like for overloaded) and
> so get rid of struct sg_lb_stats::group_asym_packing?

asym_packing uses a lot of fields of env to set group_asym_packing but
env is not statistics and i'd like to remove env from
group_classify() argument so it will use only statistics.
At now, env is an arg of group_classify only for imbalance_pct field
and I have replaced env by imabalnce_pct in a patch that is under
preparation.
With this change, I can use group_classify outside load_balance()

>
> Something like (only compile tested).
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0c3aa1dc290..b2848d6f8a2a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7692,7 +7692,6 @@ struct sg_lb_stats {
>         unsigned int idle_cpus;
>         unsigned int group_weight;
>         enum group_type group_type;
> -       unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
>         unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
>         unsigned int nr_numa_running;
> @@ -7952,6 +7951,20 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
>         return false;
>  }
>
> +static inline bool
> +group_has_asym_packing(struct lb_env *env, struct sched_group *sg,
> +                      struct sg_lb_stats *sgs)
> +{
> +       if (env->sd->flags & SD_ASYM_PACKING &&
> +           env->idle != CPU_NOT_IDLE &&
> +           sgs->sum_h_nr_running &&
> +           sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  /*
>   * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
>   * per-CPU capacity than sched_group ref.
> @@ -7983,7 +7996,7 @@ group_type group_classify(struct lb_env *env,
>         if (sg_imbalanced(group))
>                 return group_imbalanced;
>
> -       if (sgs->group_asym_packing)
> +       if (group_has_asym_packing(env, group, sgs))
>                 return group_asym_packing;
>
>         if (sgs->group_misfit_task_load)
> @@ -8076,14 +8089,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 }
>         }
>
> -       /* Check if dst cpu is idle and preferred to this group */
> -       if (env->sd->flags & SD_ASYM_PACKING &&
> -           env->idle != CPU_NOT_IDLE &&
> -           sgs->sum_h_nr_running &&
> -           sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> -               sgs->group_asym_packing = 1;
> -       }
> -
>         sgs->group_capacity = group->sgc->capacity;
>
> [...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-01  8:14     ` Vincent Guittot
@ 2019-10-01 16:52       ` Dietmar Eggemann
  2019-10-02  6:44         ` Vincent Guittot
  2019-10-02  8:23         ` Vincent Guittot
  0 siblings, 2 replies; 57+ messages in thread
From: Dietmar Eggemann @ 2019-10-01 16:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On 01/10/2019 10:14, Vincent Guittot wrote:
> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hi Vincent,
>>
>> On 19/09/2019 09:33, Vincent Guittot wrote:

[...]

>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>>   {
>>>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>         struct task_struct *p;
>>> -     unsigned long load;
>>> +     unsigned long util, load;
>>
>> Minor: Order by length or reduce scope to while loop ?
> 
> I don't get your point here

Nothing dramatic here! Just

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..a08f342ead89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
 static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
-       struct task_struct *p;
        unsigned long load, util;
+       struct task_struct *p;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);

or

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..4d1864d43ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
        struct task_struct *p;
-       unsigned long load, util;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);
@@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
                return 0;

        while (!list_empty(tasks)) {
+               unsigned long load, util;
+
                /*

[...]

>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>                 }
>>>         }
>>>
>>> -     /* Adjust by relative CPU capacity of the group */
>>> +     /* Check if dst cpu is idle and preferred to this group */
>>
>> s/preferred to/preferred by ? or the preferred CPU of this group ?
> 
> dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
> this group vs dst_cpu which belongs to another group

Ah, in the sense of 'preferred over'. Got it now!

[...]

>>> +     if (busiest->group_type == group_imbalanced) {
>>> +             /*
>>> +              * In the group_imb case we cannot rely on group-wide averages
>>> +              * to ensure CPU-load equilibrium, try to move any task to fix
>>> +              * the imbalance. The next load balance will take care of
>>> +              * balancing back the system.
>>
>> balancing back ?
> 
> In case of imbalance, we don't try to balance the system but only try
> to get rid of the pinned tasks problem. The system will still be
> unbalanced after the migration and the next load balance will take
> care of balancing the system

OK.

[...]

>>>         /*
>>> -      * Avg load of busiest sg can be less and avg load of local sg can
>>> -      * be greater than avg load across all sgs of sd because avg load
>>> -      * factors in sg capacity and sgs with smaller group_type are
>>> -      * skipped when updating the busiest sg:
>>> +      * Try to use spare capacity of local group without overloading it or
>>> +      * emptying busiest
>>>          */
>>> -     if (busiest->group_type != group_misfit_task &&
>>> -         (busiest->avg_load <= sds->avg_load ||
>>> -          local->avg_load >= sds->avg_load)) {
>>> -             env->imbalance = 0;
>>> +     if (local->group_type == group_has_spare) {
>>> +             if (busiest->group_type > group_fully_busy) {
>>
>> So this could be 'busiest->group_type == group_overloaded' here to match
>> the comment below? Since you handle group_misfit_task,
>> group_asym_packing, group_imbalanced above and return.
> 
> This is just to be more robust in case some new states are added later

OK, although I doubt that additional states can be added easily w/o
carefully auditing the entire lb code ;-)

[...]

>>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
>>> +                     /*
>>> +                      * When prefer sibling, evenly spread running tasks on
>>> +                      * groups.
>>> +                      */
>>> +                     env->balance_type = migrate_task;
>>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>> +                     return;
>>> +             }
>>> +
>>> +             /*
>>> +              * If there is no overload, we just want to even the number of
>>> +              * idle cpus.
>>> +              */
>>> +             env->balance_type = migrate_task;
>>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>>
>> Why do we need a max_t(long, 0, ...) here and not for the 'if
>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> 
> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> 
> either we have sds->prefer_sibling && busiest->sum_nr_running >
> local->sum_nr_running + 1

I see, this corresponds to

/* Try to move all excess tasks to child's sibling domain */
       if (sds.prefer_sibling && local->group_type == group_has_spare &&
           busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
               goto force_balance;

in find_busiest_group, I assume.

Haven't been able to recreate this yet on my arm64 platform since there
is no prefer_sibling and in case local and busiest have
group_type=group_has_spare they bailout in

         if (busiest->group_type != group_overloaded &&
              (env->idle == CPU_NOT_IDLE ||
               local->idle_cpus <= (busiest->idle_cpus + 1)))
                 goto out_balanced;


[...]

>>> -     if (busiest->group_type == group_overloaded &&
>>> -         local->group_type   == group_overloaded) {
>>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
>>> -             if (load_above_capacity > busiest->group_capacity) {
>>> -                     load_above_capacity -= busiest->group_capacity;
>>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
>>> -                     load_above_capacity /= busiest->group_capacity;
>>> -             } else
>>> -                     load_above_capacity = ~0UL;
>>> +     if (local->group_type < group_overloaded) {
>>> +             /*
>>> +              * Local will become overloaded so the avg_load metrics are
>>> +              * finally needed.
>>> +              */
>>
>> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
>> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
>> It would be nice to be able to match all allowed fields of dm to code sections.
> 
> decision_matrix describes how it decides between balanced or unbalanced.
> In case of dm[overload, overload], we use the avg_load to decide if it
> is balanced or not

OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
for 'sgs->group_type == group_overloaded'.

> In case of dm[fully_busy, overload], the groups are unbalanced because
> fully_busy < overload and we force the balance. Then
> calculate_imbalance() uses the avg_load to decide how much will be
> moved

And in this case 'local->group_type < group_overloaded' in
calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
calculated before using them in env->imbalance = min(...).

OK, got it now.

> dm[overload, overload]=force means that we force the balance and we
> will compute later the imbalance. avg_load may be used to calculate
> the imbalance
> dm[overload, overload]=avg_load means that we compare the avg_load to
> decide whether we need to balance load between groups
> dm[overload, overload]=nr_idle means that we compare the number of
> idle cpus to decide whether we need to balance.  In fact this is no
> more true with patch 7 because we also take into account the number of
> nr_h_running when weight =1

This becomes clearer now ... slowly.

[...]

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-01  9:14     ` Vincent Guittot
@ 2019-10-01 16:57       ` Dietmar Eggemann
  0 siblings, 0 replies; 57+ messages in thread
From: Dietmar Eggemann @ 2019-10-01 16:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton



On 01/10/2019 11:14, Vincent Guittot wrote:
> group_asym_packing
> 
> On Tue, 1 Oct 2019 at 10:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 19/09/2019 09:33, Vincent Guittot wrote:
>>
>>
>> [...]
>>
>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>               }
>>>       }
>>>
>>> -     /* Adjust by relative CPU capacity of the group */
>>> +     /* Check if dst cpu is idle and preferred to this group */
>>> +     if (env->sd->flags & SD_ASYM_PACKING &&
>>> +         env->idle != CPU_NOT_IDLE &&
>>> +         sgs->sum_h_nr_running &&
>>> +         sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
>>> +             sgs->group_asym_packing = 1;
>>> +     }
>>> +
>>
>> Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
>> not check for asym_packing in group_classify() directly (like for overloaded) and
>> so get rid of struct sg_lb_stats::group_asym_packing?
> 
> asym_packing uses a lot of fields of env to set group_asym_packing but
> env is not statistics and i'd like to remove env from
> group_classify() argument so it will use only statistics.
> At now, env is an arg of group_classify only for imbalance_pct field
> and I have replaced env by imabalnce_pct in a patch that is under
> preparation.
> With this change, I can use group_classify outside load_balance()

OK, I see. This relates to the 'update find_idlest_group() to be more
aligned with load_balance()' point mentioned in the cover letter I
assume. To make sure we can use load instead of runnable_load there as well.

[...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running
  2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
  2019-09-27 23:59   ` Rik van Riel
@ 2019-10-01 17:11   ` Valentin Schneider
  1 sibling, 0 replies; 57+ messages in thread
From: Valentin Schneider @ 2019-10-01 17:11 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton

On 19/09/2019 08:33, Vincent Guittot wrote:
> Rename sum_nr_running to sum_h_nr_running because it effectively tracks
> cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
> when needed.
> 
> There is no functional changes.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation
  2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
  2019-09-28  0:05   ` Rik van Riel
@ 2019-10-01 17:12   ` Valentin Schneider
  2019-10-02  6:28     ` Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Valentin Schneider @ 2019-10-01 17:12 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton

On 19/09/2019 08:33, Vincent Guittot wrote:
> clean up load_balance and remove meaningless calculation and fields before
> adding new algorithm.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

We'll probably want to squash the removal of fix_small_imbalance() in the
actual rework (patch 04) to not make this a regression bisect honeypot.
Other than that:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 08/10] sched/fair: use utilization to select misfit task
  2019-09-19  7:33 ` [PATCH v3 08/10] sched/fair: use utilization to select misfit task Vincent Guittot
@ 2019-10-01 17:12   ` Valentin Schneider
  0 siblings, 0 replies; 57+ messages in thread
From: Valentin Schneider @ 2019-10-01 17:12 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton

On 19/09/2019 08:33, Vincent Guittot wrote:
> utilization is used to detect a misfit task but the load is then used to
> select the task on the CPU which can lead to select a small task with
> high weight instead of the task that triggered the misfit migration.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Acked-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a7c8ee6..acca869 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7429,14 +7429,8 @@ static int detach_tasks(struct lb_env *env)
>  			break;
>  
>  		case migrate_misfit:
> -			load = task_h_load(p);
> -
> -			/*
> -			 * utilization of misfit task might decrease a bit
> -			 * since it has been recorded. Be conservative in the
> -			 * condition.
> -			 */
> -			if (load < env->imbalance)
> +			/* This is not a misfit task */
> +			if (task_fits_capacity(p, capacity_of(env->src_cpu)))
>  				goto next;
>  
>  			env->imbalance = 0;
> 

Following my comment in [1], if you can't squash that in patch 04 then
perhaps you could add that to this change:

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fac444a4831..d09ce304161d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8343,7 +8343,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_misfit_task) {
 		/* Set imbalance to allow misfit task to be balanced. */
 		env->balance_type = migrate_misfit;
-		env->imbalance = busiest->group_misfit_task_load;
+		env->imbalance = 1;
 		return;
 	}
 
---

Reason being we don't care about the load (anymore), we just want a nonzero
imbalance value, so might as well assign something static.

[1]: https://lore.kernel.org/r/74bb33d7-3ba4-aabe-c7a2-3865d5759281@arm.com

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
                     ` (2 preceding siblings ...)
  2019-10-01  8:15   ` Dietmar Eggemann
@ 2019-10-01 17:47   ` Valentin Schneider
  2019-10-02  8:30     ` Vincent Guittot
  2019-10-08 17:55   ` Peter Zijlstra
  2019-10-16  7:21   ` Parth Shah
  5 siblings, 1 reply; 57+ messages in thread
From: Valentin Schneider @ 2019-10-01 17:47 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton

On 19/09/2019 08:33, Vincent Guittot wrote:

[...]

> @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   */
>  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
>  {
> -	unsigned long max_pull, load_above_capacity = ~0UL;
>  	struct sg_lb_stats *local, *busiest;
>  
>  	local = &sds->local_stat;
>  	busiest = &sds->busiest_stat;
>  
> -	if (busiest->group_asym_packing) {
> +	if (busiest->group_type == group_misfit_task) {
> +		/* Set imbalance to allow misfit task to be balanced. */
> +		env->balance_type = migrate_misfit;
> +		env->imbalance = busiest->group_misfit_task_load;
> +		return;
> +	}
> +
> +	if (busiest->group_type == group_asym_packing) {
> +		/*
> +		 * In case of asym capacity, we will try to migrate all load to
> +		 * the preferred CPU.
> +		 */
> +		env->balance_type = migrate_load;
>  		env->imbalance = busiest->group_load;
>  		return;
>  	}
>  
> +	if (busiest->group_type == group_imbalanced) {
> +		/*
> +		 * In the group_imb case we cannot rely on group-wide averages
> +		 * to ensure CPU-load equilibrium, try to move any task to fix
> +		 * the imbalance. The next load balance will take care of
> +		 * balancing back the system.
> +		 */
> +		env->balance_type = migrate_task;
> +		env->imbalance = 1;
> +		return;
> +	}
> +
>  	/*
> -	 * Avg load of busiest sg can be less and avg load of local sg can
> -	 * be greater than avg load across all sgs of sd because avg load
> -	 * factors in sg capacity and sgs with smaller group_type are
> -	 * skipped when updating the busiest sg:
> +	 * Try to use spare capacity of local group without overloading it or
> +	 * emptying busiest
>  	 */
> -	if (busiest->group_type != group_misfit_task &&
> -	    (busiest->avg_load <= sds->avg_load ||
> -	     local->avg_load >= sds->avg_load)) {
> -		env->imbalance = 0;
> +	if (local->group_type == group_has_spare) {
> +		if (busiest->group_type > group_fully_busy) {
> +			/*
> +			 * If busiest is overloaded, try to fill spare
> +			 * capacity. This might end up creating spare capacity
> +			 * in busiest or busiest still being overloaded but
> +			 * there is no simple way to directly compute the
> +			 * amount of load to migrate in order to balance the
> +			 * system.
> +			 */
> +			env->balance_type = migrate_util;
> +			env->imbalance = max(local->group_capacity, local->group_util) -
> +				    local->group_util;
> +			return;
> +		}
> +
> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> +			/*
> +			 * When prefer sibling, evenly spread running tasks on
> +			 * groups.
> +			 */
> +			env->balance_type = migrate_task;
> +			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

Isn't that one somewhat risky?

Say both groups are classified group_has_spare and we do prefer_sibling.
We'd select busiest as the one with the maximum number of busy CPUs, but it
could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
pinned tasks or wakeup failed to properly spread stuff).

The thing should be unsigned so at least we save ourselves from right
shifting a negative value, but we still end up with a gygornous imbalance
(which we then store into env.imbalance which *is* signed... Urgh).

[...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation
  2019-10-01 17:12   ` Valentin Schneider
@ 2019-10-02  6:28     ` Vincent Guittot
  0 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-10-02  6:28 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, 1 Oct 2019 at 19:12, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 19/09/2019 08:33, Vincent Guittot wrote:
> > clean up load_balance and remove meaningless calculation and fields before
> > adding new algorithm.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> We'll probably want to squash the removal of fix_small_imbalance() in the
> actual rework (patch 04) to not make this a regression bisect honeypot.

Yes that's the plan. Peter asked to split the patch to ease the review

> Other than that:
>
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-01 16:52       ` Dietmar Eggemann
@ 2019-10-02  6:44         ` Vincent Guittot
  2019-10-02  9:21           ` Dietmar Eggemann
  2019-10-02  8:23         ` Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-02  6:44 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 01/10/2019 10:14, Vincent Guittot wrote:
> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 19/09/2019 09:33, Vincent Guittot wrote:
>
> [...]
>
> >>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> >>>   {
> >>>         struct list_head *tasks = &env->src_rq->cfs_tasks;
> >>>         struct task_struct *p;
> >>> -     unsigned long load;
> >>> +     unsigned long util, load;
> >>
> >> Minor: Order by length or reduce scope to while loop ?
> >
> > I don't get your point here
>
> Nothing dramatic here! Just
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0c3aa1dc290..a08f342ead89 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
>  static int detach_tasks(struct lb_env *env)
>  {
>         struct list_head *tasks = &env->src_rq->cfs_tasks;
> -       struct task_struct *p;
>         unsigned long load, util;
> +       struct task_struct *p;

hmm... I still don't get this.
We usually gather pointers instead of interleaving them with other varaiables

>         int detached = 0;
>
>         lockdep_assert_held(&env->src_rq->lock);
>
> or
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0c3aa1dc290..4d1864d43ed7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
>  {
>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>         struct task_struct *p;
> -       unsigned long load, util;
>         int detached = 0;
>
>         lockdep_assert_held(&env->src_rq->lock);
> @@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
>                 return 0;
>
>         while (!list_empty(tasks)) {
> +               unsigned long load, util;
> +
>                 /*
>
> [...]
>
> >>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>>                 }
> >>>         }
> >>>
> >>> -     /* Adjust by relative CPU capacity of the group */
> >>> +     /* Check if dst cpu is idle and preferred to this group */
> >>
> >> s/preferred to/preferred by ? or the preferred CPU of this group ?
> >
> > dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
> > this group vs dst_cpu which belongs to another group
>
> Ah, in the sense of 'preferred over'. Got it now!
>
> [...]
>
> >>> +     if (busiest->group_type == group_imbalanced) {
> >>> +             /*
> >>> +              * In the group_imb case we cannot rely on group-wide averages
> >>> +              * to ensure CPU-load equilibrium, try to move any task to fix
> >>> +              * the imbalance. The next load balance will take care of
> >>> +              * balancing back the system.
> >>
> >> balancing back ?
> >
> > In case of imbalance, we don't try to balance the system but only try
> > to get rid of the pinned tasks problem. The system will still be
> > unbalanced after the migration and the next load balance will take
> > care of balancing the system
>
> OK.
>
> [...]
>
> >>>         /*
> >>> -      * Avg load of busiest sg can be less and avg load of local sg can
> >>> -      * be greater than avg load across all sgs of sd because avg load
> >>> -      * factors in sg capacity and sgs with smaller group_type are
> >>> -      * skipped when updating the busiest sg:
> >>> +      * Try to use spare capacity of local group without overloading it or
> >>> +      * emptying busiest
> >>>          */
> >>> -     if (busiest->group_type != group_misfit_task &&
> >>> -         (busiest->avg_load <= sds->avg_load ||
> >>> -          local->avg_load >= sds->avg_load)) {
> >>> -             env->imbalance = 0;
> >>> +     if (local->group_type == group_has_spare) {
> >>> +             if (busiest->group_type > group_fully_busy) {
> >>
> >> So this could be 'busiest->group_type == group_overloaded' here to match
> >> the comment below? Since you handle group_misfit_task,
> >> group_asym_packing, group_imbalanced above and return.
> >
> > This is just to be more robust in case some new states are added later
>
> OK, although I doubt that additional states can be added easily w/o
> carefully auditing the entire lb code ;-)
>
> [...]
>
> >>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >>> +                     /*
> >>> +                      * When prefer sibling, evenly spread running tasks on
> >>> +                      * groups.
> >>> +                      */
> >>> +                     env->balance_type = migrate_task;
> >>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >>> +                     return;
> >>> +             }
> >>> +
> >>> +             /*
> >>> +              * If there is no overload, we just want to even the number of
> >>> +              * idle cpus.
> >>> +              */
> >>> +             env->balance_type = migrate_task;
> >>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> >>
> >> Why do we need a max_t(long, 0, ...) here and not for the 'if
> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> >
> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >
> > either we have sds->prefer_sibling && busiest->sum_nr_running >
> > local->sum_nr_running + 1
>
> I see, this corresponds to
>
> /* Try to move all excess tasks to child's sibling domain */
>        if (sds.prefer_sibling && local->group_type == group_has_spare &&
>            busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
>                goto force_balance;
>
> in find_busiest_group, I assume.

yes, it is

>
> Haven't been able to recreate this yet on my arm64 platform since there
> is no prefer_sibling and in case local and busiest have

You probably have a b.L platform for which the flag is cleared because
the hikey (dual quad cores arm64) takes advantage of  prefer sibling
at DIE level to spread tasks

> group_type=group_has_spare they bailout in
>
>          if (busiest->group_type != group_overloaded &&
>               (env->idle == CPU_NOT_IDLE ||
>                local->idle_cpus <= (busiest->idle_cpus + 1)))
>                  goto out_balanced;
>
>
> [...]
>
> >>> -     if (busiest->group_type == group_overloaded &&
> >>> -         local->group_type   == group_overloaded) {
> >>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> >>> -             if (load_above_capacity > busiest->group_capacity) {
> >>> -                     load_above_capacity -= busiest->group_capacity;
> >>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
> >>> -                     load_above_capacity /= busiest->group_capacity;
> >>> -             } else
> >>> -                     load_above_capacity = ~0UL;
> >>> +     if (local->group_type < group_overloaded) {
> >>> +             /*
> >>> +              * Local will become overloaded so the avg_load metrics are
> >>> +              * finally needed.
> >>> +              */
> >>
> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> >> It would be nice to be able to match all allowed fields of dm to code sections.
> >
> > decision_matrix describes how it decides between balanced or unbalanced.
> > In case of dm[overload, overload], we use the avg_load to decide if it
> > is balanced or not
>
> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
> for 'sgs->group_type == group_overloaded'.
>
> > In case of dm[fully_busy, overload], the groups are unbalanced because
> > fully_busy < overload and we force the balance. Then
> > calculate_imbalance() uses the avg_load to decide how much will be
> > moved
>
> And in this case 'local->group_type < group_overloaded' in
> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
> calculated before using them in env->imbalance = min(...).
>
> OK, got it now.
>
> > dm[overload, overload]=force means that we force the balance and we
> > will compute later the imbalance. avg_load may be used to calculate
> > the imbalance
> > dm[overload, overload]=avg_load means that we compare the avg_load to
> > decide whether we need to balance load between groups
> > dm[overload, overload]=nr_idle means that we compare the number of
> > idle cpus to decide whether we need to balance.  In fact this is no
> > more true with patch 7 because we also take into account the number of
> > nr_h_running when weight =1
>
> This becomes clearer now ... slowly.
>
> [...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-01 16:52       ` Dietmar Eggemann
  2019-10-02  6:44         ` Vincent Guittot
@ 2019-10-02  8:23         ` Vincent Guittot
  2019-10-02  9:24           ` Dietmar Eggemann
  1 sibling, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-02  8:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 01/10/2019 10:14, Vincent Guittot wrote:
> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 19/09/2019 09:33, Vincent Guittot wrote:
>
[...]

>
> >>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >>> +                     /*
> >>> +                      * When prefer sibling, evenly spread running tasks on
> >>> +                      * groups.
> >>> +                      */
> >>> +                     env->balance_type = migrate_task;
> >>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >>> +                     return;
> >>> +             }
> >>> +
> >>> +             /*
> >>> +              * If there is no overload, we just want to even the number of
> >>> +              * idle cpus.
> >>> +              */
> >>> +             env->balance_type = migrate_task;
> >>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> >>
> >> Why do we need a max_t(long, 0, ...) here and not for the 'if
> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> >
> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >
> > either we have sds->prefer_sibling && busiest->sum_nr_running >
> > local->sum_nr_running + 1
>
> I see, this corresponds to
>
> /* Try to move all excess tasks to child's sibling domain */
>        if (sds.prefer_sibling && local->group_type == group_has_spare &&
>            busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
>                goto force_balance;
>
> in find_busiest_group, I assume.

yes. But it seems that I missed a case:

prefer_sibling is set
busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip
goto force_balance above
But env->idle != CPU_NOT_IDLE  and local->idle_cpus >
(busiest->idle_cpus + 1) so we also skip goto out_balance and finally
call calculate_imbalance()

in calculate_imbalance with prefer_sibling set, imbalance =
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

so we probably want something similar to max_t(long, 0,
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)

>
> Haven't been able to recreate this yet on my arm64 platform since there
> is no prefer_sibling and in case local and busiest have
> group_type=group_has_spare they bailout in
>
>          if (busiest->group_type != group_overloaded &&
>               (env->idle == CPU_NOT_IDLE ||
>                local->idle_cpus <= (busiest->idle_cpus + 1)))
>                  goto out_balanced;
>
>
> [...]
>
> >>> -     if (busiest->group_type == group_overloaded &&
> >>> -         local->group_type   == group_overloaded) {
> >>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> >>> -             if (load_above_capacity > busiest->group_capacity) {
> >>> -                     load_above_capacity -= busiest->group_capacity;
> >>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
> >>> -                     load_above_capacity /= busiest->group_capacity;
> >>> -             } else
> >>> -                     load_above_capacity = ~0UL;
> >>> +     if (local->group_type < group_overloaded) {
> >>> +             /*
> >>> +              * Local will become overloaded so the avg_load metrics are
> >>> +              * finally needed.
> >>> +              */
> >>
> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> >> It would be nice to be able to match all allowed fields of dm to code sections.
> >
> > decision_matrix describes how it decides between balanced or unbalanced.
> > In case of dm[overload, overload], we use the avg_load to decide if it
> > is balanced or not
>
> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
> for 'sgs->group_type == group_overloaded'.
>
> > In case of dm[fully_busy, overload], the groups are unbalanced because
> > fully_busy < overload and we force the balance. Then
> > calculate_imbalance() uses the avg_load to decide how much will be
> > moved
>
> And in this case 'local->group_type < group_overloaded' in
> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
> calculated before using them in env->imbalance = min(...).
>
> OK, got it now.
>
> > dm[overload, overload]=force means that we force the balance and we
> > will compute later the imbalance. avg_load may be used to calculate
> > the imbalance
> > dm[overload, overload]=avg_load means that we compare the avg_load to
> > decide whether we need to balance load between groups
> > dm[overload, overload]=nr_idle means that we compare the number of
> > idle cpus to decide whether we need to balance.  In fact this is no
> > more true with patch 7 because we also take into account the number of
> > nr_h_running when weight =1
>
> This becomes clearer now ... slowly.
>
> [...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-01 17:47   ` Valentin Schneider
@ 2019-10-02  8:30     ` Vincent Guittot
  2019-10-02 10:47       ` Valentin Schneider
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-02  8:30 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, 1 Oct 2019 at 19:47, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 19/09/2019 08:33, Vincent Guittot wrote:
>
> [...]
>
> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >   */
> >  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> >  {
> > -     unsigned long max_pull, load_above_capacity = ~0UL;
> >       struct sg_lb_stats *local, *busiest;
> >
> >       local = &sds->local_stat;
> >       busiest = &sds->busiest_stat;
> >
> > -     if (busiest->group_asym_packing) {
> > +     if (busiest->group_type == group_misfit_task) {
> > +             /* Set imbalance to allow misfit task to be balanced. */
> > +             env->balance_type = migrate_misfit;
> > +             env->imbalance = busiest->group_misfit_task_load;
> > +             return;
> > +     }
> > +
> > +     if (busiest->group_type == group_asym_packing) {
> > +             /*
> > +              * In case of asym capacity, we will try to migrate all load to
> > +              * the preferred CPU.
> > +              */
> > +             env->balance_type = migrate_load;
> >               env->imbalance = busiest->group_load;
> >               return;
> >       }
> >
> > +     if (busiest->group_type == group_imbalanced) {
> > +             /*
> > +              * In the group_imb case we cannot rely on group-wide averages
> > +              * to ensure CPU-load equilibrium, try to move any task to fix
> > +              * the imbalance. The next load balance will take care of
> > +              * balancing back the system.
> > +              */
> > +             env->balance_type = migrate_task;
> > +             env->imbalance = 1;
> > +             return;
> > +     }
> > +
> >       /*
> > -      * Avg load of busiest sg can be less and avg load of local sg can
> > -      * be greater than avg load across all sgs of sd because avg load
> > -      * factors in sg capacity and sgs with smaller group_type are
> > -      * skipped when updating the busiest sg:
> > +      * Try to use spare capacity of local group without overloading it or
> > +      * emptying busiest
> >        */
> > -     if (busiest->group_type != group_misfit_task &&
> > -         (busiest->avg_load <= sds->avg_load ||
> > -          local->avg_load >= sds->avg_load)) {
> > -             env->imbalance = 0;
> > +     if (local->group_type == group_has_spare) {
> > +             if (busiest->group_type > group_fully_busy) {
> > +                     /*
> > +                      * If busiest is overloaded, try to fill spare
> > +                      * capacity. This might end up creating spare capacity
> > +                      * in busiest or busiest still being overloaded but
> > +                      * there is no simple way to directly compute the
> > +                      * amount of load to migrate in order to balance the
> > +                      * system.
> > +                      */
> > +                     env->balance_type = migrate_util;
> > +                     env->imbalance = max(local->group_capacity, local->group_util) -
> > +                                 local->group_util;
> > +                     return;
> > +             }
> > +
> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > +                     /*
> > +                      * When prefer sibling, evenly spread running tasks on
> > +                      * groups.
> > +                      */
> > +                     env->balance_type = migrate_task;
> > +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>
> Isn't that one somewhat risky?
>
> Say both groups are classified group_has_spare and we do prefer_sibling.
> We'd select busiest as the one with the maximum number of busy CPUs, but it
> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
> pinned tasks or wakeup failed to properly spread stuff).
>
> The thing should be unsigned so at least we save ourselves from right
> shifting a negative value, but we still end up with a gygornous imbalance
> (which we then store into env.imbalance which *is* signed... Urgh).

so it's not clear what happen with a right shift on negative signed
value and this seems to be compiler dependent so even
max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong

I'm going to update it


>
> [...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-02  6:44         ` Vincent Guittot
@ 2019-10-02  9:21           ` Dietmar Eggemann
  2019-10-08 13:02             ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Dietmar Eggemann @ 2019-10-02  9:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On 02/10/2019 08:44, Vincent Guittot wrote:
> On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 01/10/2019 10:14, Vincent Guittot wrote:
>>> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> On 19/09/2019 09:33, Vincent Guittot wrote:
>>
>> [...]
>>
>>>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>>>>   {
>>>>>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>>>         struct task_struct *p;
>>>>> -     unsigned long load;
>>>>> +     unsigned long util, load;
>>>>
>>>> Minor: Order by length or reduce scope to while loop ?
>>>
>>> I don't get your point here
>>
>> Nothing dramatic here! Just
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d0c3aa1dc290..a08f342ead89 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
>>  static int detach_tasks(struct lb_env *env)
>>  {
>>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>> -       struct task_struct *p;
>>         unsigned long load, util;
>> +       struct task_struct *p;
> 
> hmm... I still don't get this.
> We usually gather pointers instead of interleaving them with other varaiables

I thought we should always order local variable declarations from
longest to shortest line but can't find this rule in coding-style.rst
either.

[...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-02  8:23         ` Vincent Guittot
@ 2019-10-02  9:24           ` Dietmar Eggemann
  0 siblings, 0 replies; 57+ messages in thread
From: Dietmar Eggemann @ 2019-10-02  9:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On 02/10/2019 10:23, Vincent Guittot wrote:
> On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 01/10/2019 10:14, Vincent Guittot wrote:
>>> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> On 19/09/2019 09:33, Vincent Guittot wrote:
>>
> [...]
> 
>>
>>>>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
>>>>> +                     /*
>>>>> +                      * When prefer sibling, evenly spread running tasks on
>>>>> +                      * groups.
>>>>> +                      */
>>>>> +                     env->balance_type = migrate_task;
>>>>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>>>> +                     return;
>>>>> +             }
>>>>> +
>>>>> +             /*
>>>>> +              * If there is no overload, we just want to even the number of
>>>>> +              * idle cpus.
>>>>> +              */
>>>>> +             env->balance_type = migrate_task;
>>>>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>>>>
>>>> Why do we need a max_t(long, 0, ...) here and not for the 'if
>>>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
>>>
>>> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>>
>>> either we have sds->prefer_sibling && busiest->sum_nr_running >
>>> local->sum_nr_running + 1
>>
>> I see, this corresponds to
>>
>> /* Try to move all excess tasks to child's sibling domain */
>>        if (sds.prefer_sibling && local->group_type == group_has_spare &&
>>            busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
>>                goto force_balance;
>>
>> in find_busiest_group, I assume.
> 
> yes. But it seems that I missed a case:
> 
> prefer_sibling is set
> busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip
> goto force_balance above
> But env->idle != CPU_NOT_IDLE  and local->idle_cpus >
> (busiest->idle_cpus + 1) so we also skip goto out_balance and finally
> call calculate_imbalance()
> 
> in calculate_imbalance with prefer_sibling set, imbalance =
> (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> 
> so we probably want something similar to max_t(long, 0,
> (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)

Makes sense.

Caught a couple of

[  369.310464] 0-3->4-7 2->5 env->imbalance = 2147483646
[  369.310796] 0-3->4-7 2->4 env->imbalance = 2147483647

in this if condition on h620 running hackbench.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-02  8:30     ` Vincent Guittot
@ 2019-10-02 10:47       ` Valentin Schneider
  2019-10-08 14:16         ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Valentin Schneider @ 2019-10-02 10:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On 02/10/2019 09:30, Vincent Guittot wrote:
>> Isn't that one somewhat risky?
>>
>> Say both groups are classified group_has_spare and we do prefer_sibling.
>> We'd select busiest as the one with the maximum number of busy CPUs, but it
>> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
>> pinned tasks or wakeup failed to properly spread stuff).
>>
>> The thing should be unsigned so at least we save ourselves from right
>> shifting a negative value, but we still end up with a gygornous imbalance
>> (which we then store into env.imbalance which *is* signed... Urgh).
> 
> so it's not clear what happen with a right shift on negative signed
> value and this seems to be compiler dependent so even
> max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong
> 

Yeah, right shift on signed negative values are implementation defined. This
is what I was worried about initially, but I think the expression resulting
from the subtraction is unsigned (both terms are unsigned) so this would
just wrap when busiest < local - but that is still a problem.


((local->idle_cpus - busiest->idle_cpus) >> 1) should be fine because we do
have this check in find_busiest_group() before heading off to
calculate_imbalance():

  if (busiest->group_type != group_overloaded &&
      (env->idle == CPU_NOT_IDLE ||
       local->idle_cpus <= (busiest->idle_cpus + 1)))
	  /* ... */
	  goto out_balanced;

which ensures the subtraction will be at least 2. We're missing something
equivalent for the sum_h_nr_running case.

> I'm going to update it
> 
> 
>>
>> [...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path
  2019-09-19  7:33 ` [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
@ 2019-10-07 15:14   ` Rik van Riel
  2019-10-07 15:27     ` Vincent Guittot
  0 siblings, 1 reply; 57+ messages in thread
From: Rik van Riel @ 2019-10-07 15:14 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> runnable load has been introduced to take into account the case where
> blocked load biases the wake up path which may end to select an
> overloaded
> CPU with a large number of runnable tasks instead of an underutilized
> CPU with a huge blocked load.
> 
> Tha wake up path now starts to looks for idle CPUs before comparing
> runnable load and it's worth aligning the wake up path with the
> load_balance.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

On a single socket system, patches 9 & 10 have the
result of driving a woken up task (when wake_wide is
true) to the CPU core with the lowest blocked load,
even when there is an idle core the task could run on
right now.

With the whole series applied, I see a 1-2% regression
in CPU use due to that issue.

With only patches 1-8 applied, I see a 1% improvement in
CPU use for that same workload.

Given that it looks like select_idle_sibling and
find_idlest_group_cpu do roughly the same thing, I
wonder if it is enough to simply add an additional
test to find_idlest_group to have it return the
LLC sg, if it is called on the LLC sd on a single
socket system.

That way find_idlest_group_cpu can still find an
idle core like it does today.

Does that seem like a reasonable thing?

I can run tests with that :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path
  2019-10-07 15:14   ` Rik van Riel
@ 2019-10-07 15:27     ` Vincent Guittot
  2019-10-07 18:06       ` Rik van Riel
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-07 15:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Hillf Danton

On Mon, 7 Oct 2019 at 17:14, Rik van Riel <riel@surriel.com> wrote:
>
> On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> > runnable load has been introduced to take into account the case where
> > blocked load biases the wake up path which may end to select an
> > overloaded
> > CPU with a large number of runnable tasks instead of an underutilized
> > CPU with a huge blocked load.
> >
> > Tha wake up path now starts to looks for idle CPUs before comparing
> > runnable load and it's worth aligning the wake up path with the
> > load_balance.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> On a single socket system, patches 9 & 10 have the
> result of driving a woken up task (when wake_wide is
> true) to the CPU core with the lowest blocked load,
> even when there is an idle core the task could run on
> right now.
>
> With the whole series applied, I see a 1-2% regression
> in CPU use due to that issue.
>
> With only patches 1-8 applied, I see a 1% improvement in
> CPU use for that same workload.

Thanks for testing.
patch 8-9 have just replaced runnable load  by blocked load and then
removed the duplicated metrics in find_idlest_group.
I'm preparing an additional patch that reworks  find_idlest_group() to
behave similarly to find_busiest_group(). It gathers statistics what
it already does, then classifies the groups and finally selects the
idlest one. This should fix the problem that you mentioned above when
it selects a group with lowest blocked load whereas there are idle
cpus in another group with high blocked load.

>
> Given that it looks like select_idle_sibling and
> find_idlest_group_cpu do roughly the same thing, I
> wonder if it is enough to simply add an additional
> test to find_idlest_group to have it return the
> LLC sg, if it is called on the LLC sd on a single
> socket system.

That make sense to me

>
> That way find_idlest_group_cpu can still find an
> idle core like it does today.
>
> Does that seem like a reasonable thing?

That's worth testing

>
> I can run tests with that :)
>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path
  2019-10-07 15:27     ` Vincent Guittot
@ 2019-10-07 18:06       ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2019-10-07 18:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 2450 bytes --]

On Mon, 2019-10-07 at 17:27 +0200, Vincent Guittot wrote:
> On Mon, 7 Oct 2019 at 17:14, Rik van Riel <riel@surriel.com> wrote:
> > On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> > > runnable load has been introduced to take into account the case
> > > where
> > > blocked load biases the wake up path which may end to select an
> > > overloaded
> > > CPU with a large number of runnable tasks instead of an
> > > underutilized
> > > CPU with a huge blocked load.
> > > 
> > > Tha wake up path now starts to looks for idle CPUs before
> > > comparing
> > > runnable load and it's worth aligning the wake up path with the
> > > load_balance.
> > > 
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > 
> > On a single socket system, patches 9 & 10 have the
> > result of driving a woken up task (when wake_wide is
> > true) to the CPU core with the lowest blocked load,
> > even when there is an idle core the task could run on
> > right now.
> > 
> > With the whole series applied, I see a 1-2% regression
> > in CPU use due to that issue.
> > 
> > With only patches 1-8 applied, I see a 1% improvement in
> > CPU use for that same workload.
> 
> Thanks for testing.
> patch 8-9 have just replaced runnable load  by blocked load and then
> removed the duplicated metrics in find_idlest_group.
> I'm preparing an additional patch that reworks  find_idlest_group()
> to
> behave similarly to find_busiest_group(). It gathers statistics what
> it already does, then classifies the groups and finally selects the
> idlest one. This should fix the problem that you mentioned above when
> it selects a group with lowest blocked load whereas there are idle
> cpus in another group with high blocked load.

That should do the trick!

> > Given that it looks like select_idle_sibling and
> > find_idlest_group_cpu do roughly the same thing, I
> > wonder if it is enough to simply add an additional
> > test to find_idlest_group to have it return the
> > LLC sg, if it is called on the LLC sd on a single
> > socket system.
> 
> That make sense to me
> 
> > That way find_idlest_group_cpu can still find an
> > idle core like it does today.
> > 
> > Does that seem like a reasonable thing?
> 
> That's worth testing

I'll give it a try.

Doing the full find_idlest_group heuristic
inside an LLC seems like it would be overkill,
anyway.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-02  9:21           ` Dietmar Eggemann
@ 2019-10-08 13:02             ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2019-10-08 13:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Morten Rasmussen, Hillf Danton

On Wed, Oct 02, 2019 at 11:21:20AM +0200, Dietmar Eggemann wrote:

> I thought we should always order local variable declarations from
> longest to shortest line but can't find this rule in coding-style.rst
> either.

You're right though, that is generally encouraged. From last years
(2018) KS there was the notion of a subsystem handbook and the tip
tree's submissions thereto can be found there:

  https://lore.kernel.org/lkml/20181107171149.165693799@linutronix.de/

But for some raisin that never actually went anywhere...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-02 10:47       ` Valentin Schneider
@ 2019-10-08 14:16         ` Peter Zijlstra
  2019-10-08 14:34           ` Valentin Schneider
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2019-10-08 14:16 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:

> Yeah, right shift on signed negative values are implementation defined.

Seriously? Even under -fno-strict-overflow? There is a perfectly
sensible operation for signed shift right, this stuff should not be
undefined.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (9 preceding siblings ...)
  2019-09-19  7:33 ` [PATCH v3 10/10] sched/fair: optimize find_idlest_group Vincent Guittot
@ 2019-10-08 14:32 ` Phil Auld
  2019-10-08 15:53   ` Vincent Guittot
  2019-10-16  7:21 ` Parth Shah
  11 siblings, 1 reply; 57+ messages in thread
From: Phil Auld @ 2019-10-08 14:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, peterz, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton

Hi Vincent,

On Thu, Sep 19, 2019 at 09:33:31AM +0200 Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
> 
> Patch 1 has already been sent separately and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
> 
> Patch 2 renames the sum of h_nr_running in stats.
> 
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
> 
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
> 
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
> 
> Patch 6 replaces runnable_load by load now that the signal is only used
> when overloaded.
> 
> Patch 7 improves the spread of tasks at the 1st scheduling level.
> 
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
> 
> Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> 
> Patch 10 optimizes find_idlest_group() that was using both runnable_load
> and load. This has not been squashed with previous patch to ease the
> review.
> 
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)
> 
> hackbench
> 1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)
> 
> - large arm64 2 nodes / 224 cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe     125323 +/-0.98%       125624 +/-0.71%   (+0.24%)
> 
> hackbench -l (256000/#grp) -g #grp
> 1 groups      15.360 +/-1.76%       14.206 +/-1.40%   (+8.69%)
> 4 groups       5.822 +/-1.02%        5.508 +/-6.45%   (+5.38%)
> 16 groups      3.103 +/-0.80%        3.244 +/-0.77%   (-4.52%)
> 32 groups      2.892 +/-1.23%        2.850 +/-1.81%   (+1.47%)
> 64 groups      2.825 +/-1.51%        2.725 +/-1.51%   (+3.54%)
> 128 groups     3.149 +/-8.46%        3.053 +/-13.15%  (+3.06%)
> 256 groups     3.511 +/-8.49%        3.019 +/-1.71%  (+14.03%)
> 
> dbench
> 1 groups     329.677 +/-0.46%      329.771 +/-0.11%   (+0.03%)
> 4 groups     931.499 +/-0.79%      947.118 +/-0.94%   (+1.68%)
> 16 groups   1924.210 +/-0.89%     1947.849 +/-0.76%   (+1.23%)
> 32 groups   2350.646 +/-5.75%     2351.549 +/-6.33%   (+0.04%)
> 64 groups   2201.524 +/-3.35%     2192.749 +/-5.84%   (-0.40%)
> 128 groups  2206.858 +/-2.50%     2376.265 +/-7.44%   (+7.68%)
> 256 groups  1263.520 +/-3.34%     1633.143 +/-13.02% (+29.25%)
> 
> tip/sched/core sha1:
>   0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
> 
> Changes since v2:
> - fix typo and reorder code
> - some minor code fixes
> - optimize the find_idles_group()
> 
> Not covered in this patchset:
> - update find_idlest_group() to be more aligned with load_balance(). I didn't
>   want to delay this version because of this update which is not ready yet
> - Better detection of overloaded and fully busy state, especially for cases
>   when nr_running > nr CPUs.
> 
> Vincent Guittot (8):
>   sched/fair: clean up asym packing
>   sched/fair: rename sum_nr_running to sum_h_nr_running
>   sched/fair: remove meaningless imbalance calculation
>   sched/fair: rework load_balance
>   sched/fair: use rq->nr_running when balancing load
>   sched/fair: use load instead of runnable load in load_balance
>   sched/fair: evenly spread tasks when not overloaded
>   sched/fair: use utilization to select misfit task
>   sched/fair: use load instead of runnable load in wakeup path
>   sched/fair: optimize find_idlest_group
> 
>  kernel/sched/fair.c | 805 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 417 insertions(+), 388 deletions(-)
> 
> -- 
> 2.7.4
> 

We've been testing v3 and for the most part everything looks good. The 
group imbalance issues are fixed on all of our test systems except one.

The one is an 8-node intel system with 160 cpus. I'll put the system 
details at the end. 

This shows the average number of benchmark threads running on each node
through the run. That is, not including the 2 stress jobs. The end 
results are a 4x slow down in the cgroup case versus not. The 152 and 
156 are the number of LU threads in the run. In all cases there are 2 
stress CPU threads running either in their own cgroups (GROUP) or 
everything is in one cgroup (NORMAL).  The normal case is pretty well 
balanced with only a few >= 20 and those that are are only a little 
over. In the GROUP cases things are not so good. There are some > 30 
for example, and others < 10.


lu.C.x_152_GROUP_1   17.52  16.86  17.90  18.52  20.00  19.00  22.00  20.19
lu.C.x_152_GROUP_2   15.70  15.04  15.65  15.72  23.30  28.98  20.09  17.52
lu.C.x_152_GROUP_3   27.72  32.79  22.89  22.62  11.01  12.90  12.14  9.93
lu.C.x_152_GROUP_4   18.13  18.87  18.40  17.87  18.80  19.93  20.40  19.60
lu.C.x_152_GROUP_5   24.14  26.46  20.92  21.43  14.70  16.05  15.14  13.16
lu.C.x_152_NORMAL_1  21.03  22.43  20.27  19.97  18.37  18.80  16.27  14.87
lu.C.x_152_NORMAL_2  19.24  18.29  18.41  17.41  19.71  19.00  20.29  19.65
lu.C.x_152_NORMAL_3  19.43  20.00  19.05  20.24  18.76  17.38  18.52  18.62
lu.C.x_152_NORMAL_4  17.19  18.25  17.81  18.69  20.44  19.75  20.12  19.75
lu.C.x_152_NORMAL_5  19.25  19.56  19.12  19.56  19.38  19.38  18.12  17.62

lu.C.x_156_GROUP_1   18.62  19.31  18.38  18.77  19.88  21.35  19.35  20.35
lu.C.x_156_GROUP_2   15.58  12.72  14.96  14.83  20.59  19.35  29.75  28.22
lu.C.x_156_GROUP_3   20.05  18.74  19.63  18.32  20.26  20.89  19.53  18.58
lu.C.x_156_GROUP_4   14.77  11.42  13.01  10.09  27.05  33.52  23.16  22.98
lu.C.x_156_GROUP_5   14.94  11.45  12.77  10.52  28.01  33.88  22.37  22.05
lu.C.x_156_NORMAL_1  20.00  20.58  18.47  18.68  19.47  19.74  19.42  19.63
lu.C.x_156_NORMAL_2  18.52  18.48  18.83  18.43  20.57  20.48  20.61  20.09
lu.C.x_156_NORMAL_3  20.27  20.00  20.05  21.18  19.55  19.00  18.59  17.36
lu.C.x_156_NORMAL_4  19.65  19.60  20.25  20.75  19.35  20.10  19.00  17.30
lu.C.x_156_NORMAL_5  19.79  19.67  20.62  22.42  18.42  18.00  17.67  19.42


From what I can see this was better but not perfect in v1.  It was closer and 
so the end results (LU reported times and op/s) were close enough. But looking 
closer at it there are still some issues. (NORMAL is comparable to above)


lu.C.x_152_GROUP_1   18.08  18.17  19.58  19.29  19.25  17.50  21.46  18.67
lu.C.x_152_GROUP_2   17.12  17.48  17.88  17.62  19.57  17.31  23.00  22.02
lu.C.x_152_GROUP_3   17.82  17.97  18.12  18.18  24.55  22.18  16.97  16.21
lu.C.x_152_GROUP_4   18.47  19.08  18.50  18.66  21.45  25.00  15.47  15.37
lu.C.x_152_GROUP_5   20.46  20.71  27.38  24.75  17.06  16.65  12.81  12.19

lu.C.x_156_GROUP_1   18.70  18.80  20.25  19.50  20.45  20.30  19.55  18.45
lu.C.x_156_GROUP_2   19.29  19.90  17.71  18.10  20.76  21.57  19.81  18.86
lu.C.x_156_GROUP_3   25.09  29.19  21.83  21.33  18.67  18.57  11.03  10.29
lu.C.x_156_GROUP_4   18.60  19.10  19.20  18.70  20.30  20.00  19.70  20.40
lu.C.x_156_GROUP_5   18.58  18.9   18.63  18.16  17.32  19.37  23.92  21.08

There is high variance so it may not be anythign specific between v1 and v3 here. 

The initial fixes I made for this issue did not exhibit this behavior. They 
would have had other issues dealing with overload cases though. In this case 
however there are only 154 or 158 threads on 160 CPUs so not overloaded. 

I'll try to get my hands on this system and poke into it. I just wanted to get 
your thoughts and let you know where we are. 



Thanks,
Phil


System details:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              160
On-line CPU(s) list: 0-159
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           8
NUMA node(s):        8
Vendor ID:           GenuineIntel
CPU family:          6
Model:               47
Model name:          Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
Stepping:            2
CPU MHz:             1063.934
BogoMIPS:            4787.73
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            30720K
NUMA node0 CPU(s):   0-9,80-89
NUMA node1 CPU(s):   10-19,90-99
NUMA node2 CPU(s):   20-29,100-109
NUMA node3 CPU(s):   30-39,110-119
NUMA node4 CPU(s):   40-49,120-129
NUMA node5 CPU(s):   50-59,130-139
NUMA node6 CPU(s):   60-69,140-149
NUMA node7 CPU(s):   70-79,150-159
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
node 0 size: 64177 MB
node 0 free: 60866 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
node 1 size: 64507 MB
node 1 free: 61167 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
109
node 2 size: 64507 MB
node 2 free: 61250 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
119
node 3 size: 64507 MB
node 3 free: 61327 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
129
node 4 size: 64507 MB
node 4 free: 60993 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
139
node 5 size: 64507 MB
node 5 free: 60892 MB
node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
149
node 6 size: 64507 MB
node 6 free: 61139 MB
node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
159
node 7 size: 64480 MB
node 7 free: 61188 MB
node distances:
node   0   1   2   3   4   5   6   7  
 0:  10  12  17  17  19  19  19  19  
 1:  12  10  17  17  19  19  19  19  
 2:  17  17  10  12  19  19  19  19  
 3:  17  17  12  10  19  19  19  19  
 4:  19  19  19  19  10  12  17  17  
 5:  19  19  19  19  12  10  17  17  
 6:  19  19  19  19  17  17  10  12  
 7:  19  19  19  19  17  17  12  10



-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 14:16         ` Peter Zijlstra
@ 2019-10-08 14:34           ` Valentin Schneider
  2019-10-08 15:30             ` Vincent Guittot
  2019-10-08 16:33             ` Peter Zijlstra
  0 siblings, 2 replies; 57+ messages in thread
From: Valentin Schneider @ 2019-10-08 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On 08/10/2019 15:16, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
> 
>> Yeah, right shift on signed negative values are implementation defined.
> 
> Seriously? Even under -fno-strict-overflow? There is a perfectly
> sensible operation for signed shift right, this stuff should not be
> undefined.
> 

Mmm good point. I didn't see anything relevant in the description of that
flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:

"""
The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
the resulting value is implementation-defined.
"""

Arithmetic shift would make sense, but I think this stems from twos'
complement not being imposed: 6.2.6.2.2 says sign can be done with
sign + magnitude, twos complement or ones' complement...

I suppose when you really just want a division you should ask for division
semantics - i.e. use '/'. I'd expect compilers to be smart enough to turn
that into a shift if a power of 2 is involved, and to do something else
if negative values can be involved.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 14:34           ` Valentin Schneider
@ 2019-10-08 15:30             ` Vincent Guittot
  2019-10-08 15:48               ` Valentin Schneider
  2019-10-08 17:39               ` Peter Zijlstra
  2019-10-08 16:33             ` Peter Zijlstra
  1 sibling, 2 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-10-08 15:30 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

Le Tuesday 08 Oct 2019 à 15:34:04 (+0100), Valentin Schneider a écrit :
> On 08/10/2019 15:16, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
> > 
> >> Yeah, right shift on signed negative values are implementation defined.
> > 
> > Seriously? Even under -fno-strict-overflow? There is a perfectly
> > sensible operation for signed shift right, this stuff should not be
> > undefined.
> > 
> 
> Mmm good point. I didn't see anything relevant in the description of that
> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
> 
> """
> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
> the resulting value is implementation-defined.
> """
> 
> Arithmetic shift would make sense, but I think this stems from twos'
> complement not being imposed: 6.2.6.2.2 says sign can be done with
> sign + magnitude, twos complement or ones' complement...
> 
> I suppose when you really just want a division you should ask for division
> semantics - i.e. use '/'. I'd expect compilers to be smart enough to turn
> that into a shift if a power of 2 is involved, and to do something else
> if negative values can be involved.

This is how I plan to get ride of the problem:
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			unsigned int nr_diff = busiest->sum_h_nr_running;
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->migration_type = migrate_task;
+			lsub_positive(&nr_diff, local->sum_h_nr_running);
+			env->imbalance = nr_diff >> 1;
+			return;
+		}


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 15:30             ` Vincent Guittot
@ 2019-10-08 15:48               ` Valentin Schneider
  2019-10-08 17:39               ` Peter Zijlstra
  1 sibling, 0 replies; 57+ messages in thread
From: Valentin Schneider @ 2019-10-08 15:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On 08/10/2019 16:30, Vincent Guittot wrote:
[...]
> 
> This is how I plan to get ride of the problem:
> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> +			unsigned int nr_diff = busiest->sum_h_nr_running;
> +			/*
> +			 * When prefer sibling, evenly spread running tasks on
> +			 * groups.
> +			 */
> +			env->migration_type = migrate_task;
> +			lsub_positive(&nr_diff, local->sum_h_nr_running);
> +			env->imbalance = nr_diff >> 1;
> +			return;
> +		}
> 

I think this wants a 

/* Local could have more tasks than busiest */

atop the lsub, otherwise yeah that ought to work.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
  2019-10-08 14:32 ` [PATCH v3 0/8] sched/fair: rework the CFS load balance Phil Auld
@ 2019-10-08 15:53   ` Vincent Guittot
  2019-10-09 19:33     ` Phil Auld
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-08 15:53 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

Hi Phil,

On Tue, 8 Oct 2019 at 16:33, Phil Auld <pauld@redhat.com> wrote:
>
> Hi Vincent,
>
> On Thu, Sep 19, 2019 at 09:33:31AM +0200 Vincent Guittot wrote:
> > Several wrong task placement have been raised with the current load
> > balance algorithm but their fixes are not always straight forward and
> > end up with using biased values to force migrations. A cleanup and rework
> > of the load balance will help to handle such UCs and enable to fine grain
> > the behavior of the scheduler for other cases.
> >

[...]

> >
>
> We've been testing v3 and for the most part everything looks good. The
> group imbalance issues are fixed on all of our test systems except one.
>
> The one is an 8-node intel system with 160 cpus. I'll put the system
> details at the end.
>
> This shows the average number of benchmark threads running on each node
> through the run. That is, not including the 2 stress jobs. The end
> results are a 4x slow down in the cgroup case versus not. The 152 and
> 156 are the number of LU threads in the run. In all cases there are 2
> stress CPU threads running either in their own cgroups (GROUP) or
> everything is in one cgroup (NORMAL).  The normal case is pretty well
> balanced with only a few >= 20 and those that are are only a little
> over. In the GROUP cases things are not so good. There are some > 30
> for example, and others < 10.
>
>
> lu.C.x_152_GROUP_1   17.52  16.86  17.90  18.52  20.00  19.00  22.00  20.19
> lu.C.x_152_GROUP_2   15.70  15.04  15.65  15.72  23.30  28.98  20.09  17.52
> lu.C.x_152_GROUP_3   27.72  32.79  22.89  22.62  11.01  12.90  12.14  9.93
> lu.C.x_152_GROUP_4   18.13  18.87  18.40  17.87  18.80  19.93  20.40  19.60
> lu.C.x_152_GROUP_5   24.14  26.46  20.92  21.43  14.70  16.05  15.14  13.16
> lu.C.x_152_NORMAL_1  21.03  22.43  20.27  19.97  18.37  18.80  16.27  14.87
> lu.C.x_152_NORMAL_2  19.24  18.29  18.41  17.41  19.71  19.00  20.29  19.65
> lu.C.x_152_NORMAL_3  19.43  20.00  19.05  20.24  18.76  17.38  18.52  18.62
> lu.C.x_152_NORMAL_4  17.19  18.25  17.81  18.69  20.44  19.75  20.12  19.75
> lu.C.x_152_NORMAL_5  19.25  19.56  19.12  19.56  19.38  19.38  18.12  17.62
>
> lu.C.x_156_GROUP_1   18.62  19.31  18.38  18.77  19.88  21.35  19.35  20.35
> lu.C.x_156_GROUP_2   15.58  12.72  14.96  14.83  20.59  19.35  29.75  28.22
> lu.C.x_156_GROUP_3   20.05  18.74  19.63  18.32  20.26  20.89  19.53  18.58
> lu.C.x_156_GROUP_4   14.77  11.42  13.01  10.09  27.05  33.52  23.16  22.98
> lu.C.x_156_GROUP_5   14.94  11.45  12.77  10.52  28.01  33.88  22.37  22.05
> lu.C.x_156_NORMAL_1  20.00  20.58  18.47  18.68  19.47  19.74  19.42  19.63
> lu.C.x_156_NORMAL_2  18.52  18.48  18.83  18.43  20.57  20.48  20.61  20.09
> lu.C.x_156_NORMAL_3  20.27  20.00  20.05  21.18  19.55  19.00  18.59  17.36
> lu.C.x_156_NORMAL_4  19.65  19.60  20.25  20.75  19.35  20.10  19.00  17.30
> lu.C.x_156_NORMAL_5  19.79  19.67  20.62  22.42  18.42  18.00  17.67  19.42
>
>
> From what I can see this was better but not perfect in v1.  It was closer and
> so the end results (LU reported times and op/s) were close enough. But looking
> closer at it there are still some issues. (NORMAL is comparable to above)
>
>
> lu.C.x_152_GROUP_1   18.08  18.17  19.58  19.29  19.25  17.50  21.46  18.67
> lu.C.x_152_GROUP_2   17.12  17.48  17.88  17.62  19.57  17.31  23.00  22.02
> lu.C.x_152_GROUP_3   17.82  17.97  18.12  18.18  24.55  22.18  16.97  16.21
> lu.C.x_152_GROUP_4   18.47  19.08  18.50  18.66  21.45  25.00  15.47  15.37
> lu.C.x_152_GROUP_5   20.46  20.71  27.38  24.75  17.06  16.65  12.81  12.19
>
> lu.C.x_156_GROUP_1   18.70  18.80  20.25  19.50  20.45  20.30  19.55  18.45
> lu.C.x_156_GROUP_2   19.29  19.90  17.71  18.10  20.76  21.57  19.81  18.86
> lu.C.x_156_GROUP_3   25.09  29.19  21.83  21.33  18.67  18.57  11.03  10.29
> lu.C.x_156_GROUP_4   18.60  19.10  19.20  18.70  20.30  20.00  19.70  20.40
> lu.C.x_156_GROUP_5   18.58  18.9   18.63  18.16  17.32  19.37  23.92  21.08
>
> There is high variance so it may not be anything specific between v1 and v3 here.

While preparing v4, I have noticed that I have probably oversimplified
the end of find_idlest_group() in patch "sched/fair: optimize
find_idlest_group" when it compares local vs the idlest other group.
Especially, there were a NUMA specific test that I removed in v3 and
re added in v4.

Then, I'm also preparing a full rework that find_idlest_group() which
will behave more closely to load_balance; I mean : collect statistics,
classify group then selects the idlest

What is the behavior of lu.C Thread ? are they waking up a lot  ?and
could trigger the slow wake path ?

>
> The initial fixes I made for this issue did not exhibit this behavior. They
> would have had other issues dealing with overload cases though. In this case
> however there are only 154 or 158 threads on 160 CPUs so not overloaded.
>
> I'll try to get my hands on this system and poke into it. I just wanted to get
> your thoughts and let you know where we are.

Thanks for testing

Vincent

>
>
>
> Thanks,
> Phil
>
>
> System details:
>
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              160
> On-line CPU(s) list: 0-159
> Thread(s) per core:  2
> Core(s) per socket:  10
> Socket(s):           8
> NUMA node(s):        8
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               47
> Model name:          Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
> Stepping:            2
> CPU MHz:             1063.934
> BogoMIPS:            4787.73
> Virtualization:      VT-x
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            30720K
> NUMA node0 CPU(s):   0-9,80-89
> NUMA node1 CPU(s):   10-19,90-99
> NUMA node2 CPU(s):   20-29,100-109
> NUMA node3 CPU(s):   30-39,110-119
> NUMA node4 CPU(s):   40-49,120-129
> NUMA node5 CPU(s):   50-59,130-139
> NUMA node6 CPU(s):   60-69,140-149
> NUMA node7 CPU(s):   70-79,150-159
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
>
> $ numactl --hardware
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> node 0 size: 64177 MB
> node 0 free: 60866 MB
> node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> node 1 size: 64507 MB
> node 1 free: 61167 MB
> node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
> 109
> node 2 size: 64507 MB
> node 2 free: 61250 MB
> node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
> 119
> node 3 size: 64507 MB
> node 3 free: 61327 MB
> node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
> 129
> node 4 size: 64507 MB
> node 4 free: 60993 MB
> node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
> 139
> node 5 size: 64507 MB
> node 5 free: 60892 MB
> node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
> 149
> node 6 size: 64507 MB
> node 6 free: 61139 MB
> node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
> 159
> node 7 size: 64480 MB
> node 7 free: 61188 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>  0:  10  12  17  17  19  19  19  19
>  1:  12  10  17  17  19  19  19  19
>  2:  17  17  10  12  19  19  19  19
>  3:  17  17  12  10  19  19  19  19
>  4:  19  19  19  19  10  12  17  17
>  5:  19  19  19  19  12  10  17  17
>  6:  19  19  19  19  17  17  10  12
>  7:  19  19  19  19  17  17  12  10
>
>
>
> --

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 14:34           ` Valentin Schneider
  2019-10-08 15:30             ` Vincent Guittot
@ 2019-10-08 16:33             ` Peter Zijlstra
  2019-10-08 16:39               ` Valentin Schneider
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2019-10-08 16:33 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, Oct 08, 2019 at 03:34:04PM +0100, Valentin Schneider wrote:
> On 08/10/2019 15:16, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
> > 
> >> Yeah, right shift on signed negative values are implementation defined.
> > 
> > Seriously? Even under -fno-strict-overflow? There is a perfectly
> > sensible operation for signed shift right, this stuff should not be
> > undefined.
> > 
> 
> Mmm good point. I didn't see anything relevant in the description of that
> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
> 
> """
> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
> the resulting value is implementation-defined.
> """
> 
> Arithmetic shift would make sense, but I think this stems from twos'
> complement not being imposed: 6.2.6.2.2 says sign can be done with
> sign + magnitude, twos complement or ones' complement...

But -fno-strict-overflow mandates 2s complement for all such signed
issues.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 16:33             ` Peter Zijlstra
@ 2019-10-08 16:39               ` Valentin Schneider
  2019-10-08 17:36                 ` Valentin Schneider
  0 siblings, 1 reply; 57+ messages in thread
From: Valentin Schneider @ 2019-10-08 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On 08/10/2019 17:33, Peter Zijlstra wrote:
> On Tue, Oct 08, 2019 at 03:34:04PM +0100, Valentin Schneider wrote:
>> On 08/10/2019 15:16, Peter Zijlstra wrote:
>>> On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
>>>
>>>> Yeah, right shift on signed negative values are implementation defined.
>>>
>>> Seriously? Even under -fno-strict-overflow? There is a perfectly
>>> sensible operation for signed shift right, this stuff should not be
>>> undefined.
>>>
>>
>> Mmm good point. I didn't see anything relevant in the description of that
>> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
>>
>> """
>> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
>> the resulting value is implementation-defined.
>> """
>>
>> Arithmetic shift would make sense, but I think this stems from twos'
>> complement not being imposed: 6.2.6.2.2 says sign can be done with
>> sign + magnitude, twos complement or ones' complement...
> 
> But -fno-strict-overflow mandates 2s complement for all such signed
> issues.
> 

So then there really shouldn't be any ambiguity. I have no idea if
-fno-strict-overflow then also lifts the undefinedness of the right shifts,
gotta get my spade and dig some more.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 16:39               ` Valentin Schneider
@ 2019-10-08 17:36                 ` Valentin Schneider
  0 siblings, 0 replies; 57+ messages in thread
From: Valentin Schneider @ 2019-10-08 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On 08/10/2019 17:39, Valentin Schneider wrote:
>>
>> But -fno-strict-overflow mandates 2s complement for all such signed
>> issues.
>>
> 
> So then there really shouldn't be any ambiguity. I have no idea if
> -fno-strict-overflow then also lifts the undefinedness of the right shifts,
> gotta get my spade and dig some more.
> 

Bleh, no luck really.

Thinking about it some more you can't overflow by right shifting
(logical/arithmetic), so I dunno if signed right shift counts as "such
signed issues".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 15:30             ` Vincent Guittot
  2019-10-08 15:48               ` Valentin Schneider
@ 2019-10-08 17:39               ` Peter Zijlstra
  2019-10-08 18:45                 ` Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2019-10-08 17:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, Oct 08, 2019 at 05:30:02PM +0200, Vincent Guittot wrote:

> This is how I plan to get ride of the problem:
> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> +			unsigned int nr_diff = busiest->sum_h_nr_running;
> +			/*
> +			 * When prefer sibling, evenly spread running tasks on
> +			 * groups.
> +			 */
> +			env->migration_type = migrate_task;
> +			lsub_positive(&nr_diff, local->sum_h_nr_running);
> +			env->imbalance = nr_diff >> 1;
> +			return;
> +		}

I'm thinking the max_t(long, 0, ...); variant reads a lot simpler and
really _should_ work given that -fno-strict-overflow / -fwrapv mandates
2s complement.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
                     ` (3 preceding siblings ...)
  2019-10-01 17:47   ` Valentin Schneider
@ 2019-10-08 17:55   ` Peter Zijlstra
  2019-10-08 18:47     ` Vincent Guittot
  2019-10-16  7:21   ` Parth Shah
  5 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2019-10-08 17:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton

On Thu, Sep 19, 2019 at 09:33:35AM +0200, Vincent Guittot wrote:
> +	if (busiest->group_type == group_asym_packing) {
> +		/*
> +		 * In case of asym capacity, we will try to migrate all load to
> +		 * the preferred CPU.
> +		 */
> +		env->balance_type = migrate_load;
>  		env->imbalance = busiest->group_load;
>  		return;
>  	}

I was a bit surprised with this; I sorta expected a migrate_task,1 here.

The asym_packing thing has always been a nr_running issue to me. If
there is something to run, we should run on as many siblings/cores as
possible, but preferably the 'highest' ranked siblings/cores.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 17:39               ` Peter Zijlstra
@ 2019-10-08 18:45                 ` Vincent Guittot
  0 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-10-08 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, linux-kernel, Ingo Molnar, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, 8 Oct 2019 at 19:39, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Oct 08, 2019 at 05:30:02PM +0200, Vincent Guittot wrote:
>
> > This is how I plan to get ride of the problem:
> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > +                     unsigned int nr_diff = busiest->sum_h_nr_running;
> > +                     /*
> > +                      * When prefer sibling, evenly spread running tasks on
> > +                      * groups.
> > +                      */
> > +                     env->migration_type = migrate_task;
> > +                     lsub_positive(&nr_diff, local->sum_h_nr_running);
> > +                     env->imbalance = nr_diff >> 1;
> > +                     return;
> > +             }
>
> I'm thinking the max_t(long, 0, ...); variant reads a lot simpler and
> really _should_ work given that -fno-strict-overflow / -fwrapv mandates
> 2s complement.

Another point that I have overlooked is that sum_h_nr_running is
unsigned int whereas imbalance is long

In fact, (long) (unsigned long A - unsigned long B) >> 1 works correctly
but
(long) (unsigned int A - unsigned int B) >> 1 doesn't

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-08 17:55   ` Peter Zijlstra
@ 2019-10-08 18:47     ` Vincent Guittot
  0 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-10-08 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, 8 Oct 2019 at 19:55, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 19, 2019 at 09:33:35AM +0200, Vincent Guittot wrote:
> > +     if (busiest->group_type == group_asym_packing) {
> > +             /*
> > +              * In case of asym capacity, we will try to migrate all load to
> > +              * the preferred CPU.
> > +              */
> > +             env->balance_type = migrate_load;
> >               env->imbalance = busiest->group_load;
> >               return;
> >       }
>
> I was a bit surprised with this; I sorta expected a migrate_task,1 here.

I have just kept current mechanism

>
> The asym_packing thing has always been a nr_running issue to me. If
> there is something to run, we should run on as many siblings/cores as
> possible, but preferably the 'highest' ranked siblings/cores.

I can probably set the number of running task to migrate instead of the load
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
  2019-10-08 15:53   ` Vincent Guittot
@ 2019-10-09 19:33     ` Phil Auld
  2019-10-10  8:20       ` Vincent Guittot
  0 siblings, 1 reply; 57+ messages in thread
From: Phil Auld @ 2019-10-09 19:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Tue, Oct 08, 2019 at 05:53:11PM +0200 Vincent Guittot wrote:
> Hi Phil,
>  

...

> While preparing v4, I have noticed that I have probably oversimplified
> the end of find_idlest_group() in patch "sched/fair: optimize
> find_idlest_group" when it compares local vs the idlest other group.
> Especially, there were a NUMA specific test that I removed in v3 and
> re added in v4.
> 
> Then, I'm also preparing a full rework that find_idlest_group() which
> will behave more closely to load_balance; I mean : collect statistics,
> classify group then selects the idlest
> 

Okay, I'll watch for V4 and restest. It really seems to be limited to 
the 8-node system. None of the other systems are showing this.


> What is the behavior of lu.C Thread ? are they waking up a lot  ?and
> could trigger the slow wake path ?

Yes, probably a fair bit of waking. It's an interative equation solving
code. It's fairly CPU intensive but requires communication for dependent
calculations.  That's part of why having them mis-balanced causes such a 
large slow down. I think at times everyone else waits for the slow guys.
 

> 
> >
> > The initial fixes I made for this issue did not exhibit this behavior. They
> > would have had other issues dealing with overload cases though. In this case
> > however there are only 154 or 158 threads on 160 CPUs so not overloaded.
> >
> > I'll try to get my hands on this system and poke into it. I just wanted to get
> > your thoughts and let you know where we are.
> 
> Thanks for testing
>

Sure!


Thanks,
Phil



 
> Vincent
> 
> >
> >
> >
> > Thanks,
> > Phil
> >
> >
> > System details:
> >
> > Architecture:        x86_64
> > CPU op-mode(s):      32-bit, 64-bit
> > Byte Order:          Little Endian
> > CPU(s):              160
> > On-line CPU(s) list: 0-159
> > Thread(s) per core:  2
> > Core(s) per socket:  10
> > Socket(s):           8
> > NUMA node(s):        8
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               47
> > Model name:          Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
> > Stepping:            2
> > CPU MHz:             1063.934
> > BogoMIPS:            4787.73
> > Virtualization:      VT-x
> > L1d cache:           32K
> > L1i cache:           32K
> > L2 cache:            256K
> > L3 cache:            30720K
> > NUMA node0 CPU(s):   0-9,80-89
> > NUMA node1 CPU(s):   10-19,90-99
> > NUMA node2 CPU(s):   20-29,100-109
> > NUMA node3 CPU(s):   30-39,110-119
> > NUMA node4 CPU(s):   40-49,120-129
> > NUMA node5 CPU(s):   50-59,130-139
> > NUMA node6 CPU(s):   60-69,140-149
> > NUMA node7 CPU(s):   70-79,150-159
> > Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
> >
> > $ numactl --hardware
> > available: 8 nodes (0-7)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> > node 0 size: 64177 MB
> > node 0 free: 60866 MB
> > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> > node 1 size: 64507 MB
> > node 1 free: 61167 MB
> > node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
> > 109
> > node 2 size: 64507 MB
> > node 2 free: 61250 MB
> > node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
> > 119
> > node 3 size: 64507 MB
> > node 3 free: 61327 MB
> > node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
> > 129
> > node 4 size: 64507 MB
> > node 4 free: 60993 MB
> > node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
> > 139
> > node 5 size: 64507 MB
> > node 5 free: 60892 MB
> > node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
> > 149
> > node 6 size: 64507 MB
> > node 6 free: 61139 MB
> > node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
> > 159
> > node 7 size: 64480 MB
> > node 7 free: 61188 MB
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >  0:  10  12  17  17  19  19  19  19
> >  1:  12  10  17  17  19  19  19  19
> >  2:  17  17  10  12  19  19  19  19
> >  3:  17  17  12  10  19  19  19  19
> >  4:  19  19  19  19  10  12  17  17
> >  5:  19  19  19  19  12  10  17  17
> >  6:  19  19  19  19  17  17  10  12
> >  7:  19  19  19  19  17  17  12  10
> >
> >
> >
> > --

-- 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
  2019-10-09 19:33     ` Phil Auld
@ 2019-10-10  8:20       ` Vincent Guittot
  0 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-10-10  8:20 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton

On Wed, 9 Oct 2019 at 21:33, Phil Auld <pauld@redhat.com> wrote:
>
> On Tue, Oct 08, 2019 at 05:53:11PM +0200 Vincent Guittot wrote:
> > Hi Phil,
> >
>
> ...
>
> > While preparing v4, I have noticed that I have probably oversimplified
> > the end of find_idlest_group() in patch "sched/fair: optimize
> > find_idlest_group" when it compares local vs the idlest other group.
> > Especially, there were a NUMA specific test that I removed in v3 and
> > re added in v4.
> >
> > Then, I'm also preparing a full rework that find_idlest_group() which
> > will behave more closely to load_balance; I mean : collect statistics,
> > classify group then selects the idlest
> >
>
> Okay, I'll watch for V4 and restest. It really seems to be limited to
> the 8-node system. None of the other systems are showing this.

Thanks for the information. This system might generate statistics at
boundaries between 2 behaviors and task placement misbehave

>
>
> > What is the behavior of lu.C Thread ? are they waking up a lot  ?and
> > could trigger the slow wake path ?
>
> Yes, probably a fair bit of waking. It's an interative equation solving
> code. It's fairly CPU intensive but requires communication for dependent
> calculations.  That's part of why having them mis-balanced causes such a
> large slow down. I think at times everyone else waits for the slow guys.
>
>
> >
> > >
> > > The initial fixes I made for this issue did not exhibit this behavior. They
> > > would have had other issues dealing with overload cases though. In this case
> > > however there are only 154 or 158 threads on 160 CPUs so not overloaded.
> > >
> > > I'll try to get my hands on this system and poke into it. I just wanted to get
> > > your thoughts and let you know where we are.
> >
> > Thanks for testing
> >
>
> Sure!
>
>
> Thanks,
> Phil
>
>
>
>
> > Vincent
> >
> > >
> > >
> > >
> > > Thanks,
> > > Phil
> > >
> > >
> > > System details:
> > >
> > > Architecture:        x86_64
> > > CPU op-mode(s):      32-bit, 64-bit
> > > Byte Order:          Little Endian
> > > CPU(s):              160
> > > On-line CPU(s) list: 0-159
> > > Thread(s) per core:  2
> > > Core(s) per socket:  10
> > > Socket(s):           8
> > > NUMA node(s):        8
> > > Vendor ID:           GenuineIntel
> > > CPU family:          6
> > > Model:               47
> > > Model name:          Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
> > > Stepping:            2
> > > CPU MHz:             1063.934
> > > BogoMIPS:            4787.73
> > > Virtualization:      VT-x
> > > L1d cache:           32K
> > > L1i cache:           32K
> > > L2 cache:            256K
> > > L3 cache:            30720K
> > > NUMA node0 CPU(s):   0-9,80-89
> > > NUMA node1 CPU(s):   10-19,90-99
> > > NUMA node2 CPU(s):   20-29,100-109
> > > NUMA node3 CPU(s):   30-39,110-119
> > > NUMA node4 CPU(s):   40-49,120-129
> > > NUMA node5 CPU(s):   50-59,130-139
> > > NUMA node6 CPU(s):   60-69,140-149
> > > NUMA node7 CPU(s):   70-79,150-159
> > > Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
> > >
> > > $ numactl --hardware
> > > available: 8 nodes (0-7)
> > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> > > node 0 size: 64177 MB
> > > node 0 free: 60866 MB
> > > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> > > node 1 size: 64507 MB
> > > node 1 free: 61167 MB
> > > node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
> > > 109
> > > node 2 size: 64507 MB
> > > node 2 free: 61250 MB
> > > node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
> > > 119
> > > node 3 size: 64507 MB
> > > node 3 free: 61327 MB
> > > node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
> > > 129
> > > node 4 size: 64507 MB
> > > node 4 free: 60993 MB
> > > node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
> > > 139
> > > node 5 size: 64507 MB
> > > node 5 free: 60892 MB
> > > node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
> > > 149
> > > node 6 size: 64507 MB
> > > node 6 free: 61139 MB
> > > node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
> > > 159
> > > node 7 size: 64480 MB
> > > node 7 free: 61188 MB
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >  0:  10  12  17  17  19  19  19  19
> > >  1:  12  10  17  17  19  19  19  19
> > >  2:  17  17  10  12  19  19  19  19
> > >  3:  17  17  12  10  19  19  19  19
> > >  4:  19  19  19  19  10  12  17  17
> > >  5:  19  19  19  19  12  10  17  17
> > >  6:  19  19  19  19  17  17  10  12
> > >  7:  19  19  19  19  17  17  12  10
> > >
> > >
> > >
> > > --
>
> --

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
  2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (10 preceding siblings ...)
  2019-10-08 14:32 ` [PATCH v3 0/8] sched/fair: rework the CFS load balance Phil Auld
@ 2019-10-16  7:21 ` Parth Shah
  2019-10-16 11:51   ` Vincent Guittot
  11 siblings, 1 reply; 57+ messages in thread
From: Parth Shah @ 2019-10-16  7:21 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton



On 9/19/19 1:03 PM, Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
> 
> Patch 1 has already been sent separately and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
> 
> Patch 2 renames the sum of h_nr_running in stats.
> 
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
> 
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
> 
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
> 
> Patch 6 replaces runnable_load by load now that the signal is only used
> when overloaded.
> 
> Patch 7 improves the spread of tasks at the 1st scheduling level.
> 
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
> 
> Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> 
> Patch 10 optimizes find_idlest_group() that was using both runnable_load
> and load. This has not been squashed with previous patch to ease the
> review.
> 
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)
> 
> hackbench
> 1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)
> 
> - large arm64 2 nodes / 224 cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe     125323 +/-0.98%       125624 +/-0.71%   (+0.24%)
> 
> hackbench -l (256000/#grp) -g #grp
> 1 groups      15.360 +/-1.76%       14.206 +/-1.40%   (+8.69%)
> 4 groups       5.822 +/-1.02%        5.508 +/-6.45%   (+5.38%)
> 16 groups      3.103 +/-0.80%        3.244 +/-0.77%   (-4.52%)
> 32 groups      2.892 +/-1.23%        2.850 +/-1.81%   (+1.47%)
> 64 groups      2.825 +/-1.51%        2.725 +/-1.51%   (+3.54%)
> 128 groups     3.149 +/-8.46%        3.053 +/-13.15%  (+3.06%)
> 256 groups     3.511 +/-8.49%        3.019 +/-1.71%  (+14.03%)
> 
> dbench
> 1 groups     329.677 +/-0.46%      329.771 +/-0.11%   (+0.03%)
> 4 groups     931.499 +/-0.79%      947.118 +/-0.94%   (+1.68%)
> 16 groups   1924.210 +/-0.89%     1947.849 +/-0.76%   (+1.23%)
> 32 groups   2350.646 +/-5.75%     2351.549 +/-6.33%   (+0.04%)
> 64 groups   2201.524 +/-3.35%     2192.749 +/-5.84%   (-0.40%)
> 128 groups  2206.858 +/-2.50%     2376.265 +/-7.44%   (+7.68%)
> 256 groups  1263.520 +/-3.34%     1633.143 +/-13.02% (+29.25%)
> 
> tip/sched/core sha1:
>   0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
> [...]

I am quietly impressed with this patch series as it makes easy to
understand the behavior of the load balancer just by looking at the code.

I have tested v3 on IBM POWER9 system with following configuration:
- CPU(s):              176
- Thread(s) per core:  4
- Core(s) per socket:  22
- Socket(s):           2
- Model name:          POWER9, altivec supported
- NUMA node0 CPU(s):   0-87
- NUMA node8 CPU(s):   88-175

I see results in par with the baseline (tip/sched/core) with most of my
testings.

hackbench
=========
hackbench -l (256000/#grp) -g #grp (lower is better):
+--------+--------------------+-------------------+------------------+
| groups |     w/ patches     |     Baseline      | Performance gain |
+--------+--------------------+-------------------+------------------+
|      1 | 14.948 (+/- 0.10)  | 15.13 (+/- 0.47 ) | +1.20            |
|      4 | 5.938 (+/- 0.034)  | 6.085 (+/- 0.07)  | +2.4             |
|      8 | 6.594 (+/- 0.072)  | 6.223 (+/- 0.03)  | -5.9             |
|     16 | 5.916 (+/- 0.05)   | 5.559 (+/- 0.00)  | -6.4             |
|     32 | 5.288 (+/- 0.034)  | 5.23 (+/- 0.01)   | -1.1             |
|     64 | 5.147 (+/- 0.036)  | 5.193 (+/- 0.09)  | +0.8             |
|    128 | 5.368 (+/- 0.0245) | 5.446 (+/- 0.04)  | +1.4             |
|    256 | 5.637 (+/- 0.088)  | 5.596 (+/- 0.07)  | -0.7             |
|    512 | 5.78 (+/- 0.0637)  | 5.934 (+/- 0.06)  | +2.5             |
+--------+--------------------+-------------------+------------------+


dbench
========
dbench <grp> (Throughput: Higher is better):
+---------+---------------------+-----------------------+----------+
| groups  |     w/ patches      |        baseline       |    gain  |
+---------+---------------------+-----------------------+----------+
|       1 |    12.6419(+/-0.58) |    12.6511 (+/-0.277) |   -0.00  |
|       4 |    23.7712(+/-2.22) |    21.8526 (+/-0.844) |   +8.7   |
|       8 |    40.1333(+/-0.85) |    37.0623 (+/-3.283) |   +8.2   |
|      16 |   60.5529(+/-2.35)  |    60.0972 (+/-9.655) |   +0.7   |
|      32 |   98.2194(+/-1.69)  |    87.6701 (+/-10.72) |   +12.0  |
|      64 |   150.733(+/-9.91)  |    109.782 (+/-0.503) |   +37.3  |
|     128 |  173.443(+/-22.4)   |    130.006 (+/-21.84) |   +33.4  |
|     256 |  121.011(+/-15.2)   |    120.603 (+/-11.82) |   +0.3   |
|     512 |  10.9889(+/-0.39)   |    12.5518 (+/-1.030) |   -12    |
+---------+---------------------+-----------------------+----------+



I am happy with the results as it turns to be beneficial in most cases.
Still I will be doing testing for different scenarios and workloads.
BTW do you have any specific test case which might show different behavior
for SMT-4/8 systems with these patch set?


Thanks,
Parth


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
                     ` (4 preceding siblings ...)
  2019-10-08 17:55   ` Peter Zijlstra
@ 2019-10-16  7:21   ` Parth Shah
  2019-10-16 11:56     ` Vincent Guittot
  5 siblings, 1 reply; 57+ messages in thread
From: Parth Shah @ 2019-10-16  7:21 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton



On 9/19/19 1:03 PM, Vincent Guittot wrote:

[...]

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 380 insertions(+), 205 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 017aad0..d33379c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> 
>  enum fbq_type { regular, remote, all };
> 
> +/*
> + * group_type describes the group of CPUs at the moment of the load balance.
> + * The enum is ordered by pulling priority, with the group with lowest priority
> + * first so the groupe_type can be simply compared when selecting the busiest
> + * group. see update_sd_pick_busiest().
> + */
>  enum group_type {
> -	group_other = 0,
> +	group_has_spare = 0,
> +	group_fully_busy,
>  	group_misfit_task,
> +	group_asym_packing,
>  	group_imbalanced,
> -	group_overloaded,
> +	group_overloaded
> +};
> +
> +enum migration_type {
> +	migrate_load = 0,
> +	migrate_util,
> +	migrate_task,
> +	migrate_misfit
>  };
> 
>  #define LBF_ALL_PINNED	0x01
> @@ -7115,7 +7130,7 @@ struct lb_env {
>  	unsigned int		loop_max;
> 
>  	enum fbq_type		fbq_type;
> -	enum group_type		src_grp_type;
> +	enum migration_type	balance_type;
>  	struct list_head	tasks;
>  };
> 
> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>  {
>  	struct list_head *tasks = &env->src_rq->cfs_tasks;
>  	struct task_struct *p;
> -	unsigned long load;
> +	unsigned long util, load;
>  	int detached = 0;
> 
>  	lockdep_assert_held(&env->src_rq->lock);
> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
>  		if (!can_migrate_task(p, env))
>  			goto next;
> 
> -		load = task_h_load(p);
> +		switch (env->balance_type) {
> +		case migrate_load:
> +			load = task_h_load(p);
> 
> -		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> -			goto next;
> +			if (sched_feat(LB_MIN) &&
> +			    load < 16 && !env->sd->nr_balance_failed)
> +				goto next;
> 
> -		if ((load / 2) > env->imbalance)
> -			goto next;
> +			if ((load / 2) > env->imbalance)
> +				goto next;
> +
> +			env->imbalance -= load;
> +			break;
> +
> +		case migrate_util:
> +			util = task_util_est(p);
> +
> +			if (util > env->imbalance)

Can you please explain what would happen for
`if (util/2 > env->imbalance)` ?
just like when migrating load, even util shouldn't be migrated if
env->imbalance is just near the utilization of the task being moved, isn't it?

> +				goto next;
> +
> +			env->imbalance -= util;
> +			break;
> +[ ... ]

Thanks,
Parth


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
  2019-10-16  7:21 ` Parth Shah
@ 2019-10-16 11:51   ` Vincent Guittot
  0 siblings, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2019-10-16 11:51 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Hillf Danton

Hi Parth,

On Wed, 16 Oct 2019 at 09:21, Parth Shah <parth@linux.ibm.com> wrote:
>
>
>
> On 9/19/19 1:03 PM, Vincent Guittot wrote:
> > Several wrong task placement have been raised with the current load
> > balance algorithm but their fixes are not always straight forward and
> > end up with using biased values to force migrations. A cleanup and rework
> > of the load balance will help to handle such UCs and enable to fine grain
> > the behavior of the scheduler for other cases.
> >
> > Patch 1 has already been sent separately and only consolidate asym policy
> > in one place and help the review of the changes in load_balance.
> >
> > Patch 2 renames the sum of h_nr_running in stats.
> >
> > Patch 3 removes meaningless imbalance computation to make review of
> > patch 4 easier.
> >
> > Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> > but try to stay conservative.
> >
> > Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> > into account when pulling tasks.
> >
> > Patch 6 replaces runnable_load by load now that the signal is only used
> > when overloaded.
> >
> > Patch 7 improves the spread of tasks at the 1st scheduling level.
> >
> > Patch 8 uses utilization instead of load in all steps of misfit task
> > path.
> >
> > Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> >
> > Patch 10 optimizes find_idlest_group() that was using both runnable_load
> > and load. This has not been squashed with previous patch to ease the
> > review.
> >
> > Some benchmarks results based on 8 iterations of each tests:
> > - small arm64 dual quad cores system
> >
> >            tip/sched/core        w/ this patchset    improvement
> > schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)
> >
> > hackbench
> > 1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)
> >
> > - large arm64 2 nodes / 224 cores system
> >
> >            tip/sched/core        w/ this patchset    improvement
> > schedpipe     125323 +/-0.98%       125624 +/-0.71%   (+0.24%)
> >
> > hackbench -l (256000/#grp) -g #grp
> > 1 groups      15.360 +/-1.76%       14.206 +/-1.40%   (+8.69%)
> > 4 groups       5.822 +/-1.02%        5.508 +/-6.45%   (+5.38%)
> > 16 groups      3.103 +/-0.80%        3.244 +/-0.77%   (-4.52%)
> > 32 groups      2.892 +/-1.23%        2.850 +/-1.81%   (+1.47%)
> > 64 groups      2.825 +/-1.51%        2.725 +/-1.51%   (+3.54%)
> > 128 groups     3.149 +/-8.46%        3.053 +/-13.15%  (+3.06%)
> > 256 groups     3.511 +/-8.49%        3.019 +/-1.71%  (+14.03%)
> >
> > dbench
> > 1 groups     329.677 +/-0.46%      329.771 +/-0.11%   (+0.03%)
> > 4 groups     931.499 +/-0.79%      947.118 +/-0.94%   (+1.68%)
> > 16 groups   1924.210 +/-0.89%     1947.849 +/-0.76%   (+1.23%)
> > 32 groups   2350.646 +/-5.75%     2351.549 +/-6.33%   (+0.04%)
> > 64 groups   2201.524 +/-3.35%     2192.749 +/-5.84%   (-0.40%)
> > 128 groups  2206.858 +/-2.50%     2376.265 +/-7.44%   (+7.68%)
> > 256 groups  1263.520 +/-3.34%     1633.143 +/-13.02% (+29.25%)
> >
> > tip/sched/core sha1:
> >   0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
> > [...]
>
> I am quietly impressed with this patch series as it makes easy to
> understand the behavior of the load balancer just by looking at the code.

Thanks

>
> I have tested v3 on IBM POWER9 system with following configuration:
> - CPU(s):              176
> - Thread(s) per core:  4
> - Core(s) per socket:  22
> - Socket(s):           2
> - Model name:          POWER9, altivec supported
> - NUMA node0 CPU(s):   0-87
> - NUMA node8 CPU(s):   88-175
>
> I see results in par with the baseline (tip/sched/core) with most of my
> testings.
>
> hackbench
> =========
> hackbench -l (256000/#grp) -g #grp (lower is better):
> +--------+--------------------+-------------------+------------------+
> | groups |     w/ patches     |     Baseline      | Performance gain |
> +--------+--------------------+-------------------+------------------+
> |      1 | 14.948 (+/- 0.10)  | 15.13 (+/- 0.47 ) | +1.20            |
> |      4 | 5.938 (+/- 0.034)  | 6.085 (+/- 0.07)  | +2.4             |
> |      8 | 6.594 (+/- 0.072)  | 6.223 (+/- 0.03)  | -5.9             |
> |     16 | 5.916 (+/- 0.05)   | 5.559 (+/- 0.00)  | -6.4             |
> |     32 | 5.288 (+/- 0.034)  | 5.23 (+/- 0.01)   | -1.1             |
> |     64 | 5.147 (+/- 0.036)  | 5.193 (+/- 0.09)  | +0.8             |
> |    128 | 5.368 (+/- 0.0245) | 5.446 (+/- 0.04)  | +1.4             |
> |    256 | 5.637 (+/- 0.088)  | 5.596 (+/- 0.07)  | -0.7             |
> |    512 | 5.78 (+/- 0.0637)  | 5.934 (+/- 0.06)  | +2.5             |
> +--------+--------------------+-------------------+------------------+
>
>
> dbench
> ========
> dbench <grp> (Throughput: Higher is better):
> +---------+---------------------+-----------------------+----------+
> | groups  |     w/ patches      |        baseline       |    gain  |
> +---------+---------------------+-----------------------+----------+
> |       1 |    12.6419(+/-0.58) |    12.6511 (+/-0.277) |   -0.00  |
> |       4 |    23.7712(+/-2.22) |    21.8526 (+/-0.844) |   +8.7   |
> |       8 |    40.1333(+/-0.85) |    37.0623 (+/-3.283) |   +8.2   |
> |      16 |   60.5529(+/-2.35)  |    60.0972 (+/-9.655) |   +0.7   |
> |      32 |   98.2194(+/-1.69)  |    87.6701 (+/-10.72) |   +12.0  |
> |      64 |   150.733(+/-9.91)  |    109.782 (+/-0.503) |   +37.3  |
> |     128 |  173.443(+/-22.4)   |    130.006 (+/-21.84) |   +33.4  |
> |     256 |  121.011(+/-15.2)   |    120.603 (+/-11.82) |   +0.3   |
> |     512 |  10.9889(+/-0.39)   |    12.5518 (+/-1.030) |   -12    |
> +---------+---------------------+-----------------------+----------+
>
>
>
> I am happy with the results as it turns to be beneficial in most cases.
> Still I will be doing testing for different scenarios and workloads.

Thanks for the tests results

> BTW do you have any specific test case which might show different behavior
> for SMT-4/8 systems with these patch set?

No, I don't have any specific tests case in mind but it should better
spread tasks in cores for middle loaded UCs with tasks with different
weight either beacuse different nice prio or because of cgroup. The
load_balance doesn't use load in this case unlike current
implementation.

Thanks
Vincent

>
>
> Thanks,
> Parth
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-16  7:21   ` Parth Shah
@ 2019-10-16 11:56     ` Vincent Guittot
  2019-10-18  5:34       ` Parth Shah
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2019-10-16 11:56 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Hillf Danton

On Wed, 16 Oct 2019 at 09:21, Parth Shah <parth@linux.ibm.com> wrote:
>
>
>
> On 9/19/19 1:03 PM, Vincent Guittot wrote:
>
> [...]
>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 380 insertions(+), 205 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 017aad0..d33379c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> >
> >  enum fbq_type { regular, remote, all };
> >
> > +/*
> > + * group_type describes the group of CPUs at the moment of the load balance.
> > + * The enum is ordered by pulling priority, with the group with lowest priority
> > + * first so the groupe_type can be simply compared when selecting the busiest
> > + * group. see update_sd_pick_busiest().
> > + */
> >  enum group_type {
> > -     group_other = 0,
> > +     group_has_spare = 0,
> > +     group_fully_busy,
> >       group_misfit_task,
> > +     group_asym_packing,
> >       group_imbalanced,
> > -     group_overloaded,
> > +     group_overloaded
> > +};
> > +
> > +enum migration_type {
> > +     migrate_load = 0,
> > +     migrate_util,
> > +     migrate_task,
> > +     migrate_misfit
> >  };
> >
> >  #define LBF_ALL_PINNED       0x01
> > @@ -7115,7 +7130,7 @@ struct lb_env {
> >       unsigned int            loop_max;
> >
> >       enum fbq_type           fbq_type;
> > -     enum group_type         src_grp_type;
> > +     enum migration_type     balance_type;
> >       struct list_head        tasks;
> >  };
> >
> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> >  {
> >       struct list_head *tasks = &env->src_rq->cfs_tasks;
> >       struct task_struct *p;
> > -     unsigned long load;
> > +     unsigned long util, load;
> >       int detached = 0;
> >
> >       lockdep_assert_held(&env->src_rq->lock);
> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> >               if (!can_migrate_task(p, env))
> >                       goto next;
> >
> > -             load = task_h_load(p);
> > +             switch (env->balance_type) {
> > +             case migrate_load:
> > +                     load = task_h_load(p);
> >
> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > -                     goto next;
> > +                     if (sched_feat(LB_MIN) &&
> > +                         load < 16 && !env->sd->nr_balance_failed)
> > +                             goto next;
> >
> > -             if ((load / 2) > env->imbalance)
> > -                     goto next;
> > +                     if ((load / 2) > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= load;
> > +                     break;
> > +
> > +             case migrate_util:
> > +                     util = task_util_est(p);
> > +
> > +                     if (util > env->imbalance)
>
> Can you please explain what would happen for
> `if (util/2 > env->imbalance)` ?
> just like when migrating load, even util shouldn't be migrated if
> env->imbalance is just near the utilization of the task being moved, isn't it?

I have chosen uti and not util/2 to be conservative because
migrate_util is used to fill spare capacity.
With `if (util/2 > env->imbalance)`, we can more easily overload the
local group or pick too much utilization from the overloaded group.

>
> > +                             goto next;
> > +
> > +                     env->imbalance -= util;
> > +                     break;
> > +[ ... ]
>
> Thanks,
> Parth
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 04/10] sched/fair: rework load_balance
  2019-10-16 11:56     ` Vincent Guittot
@ 2019-10-18  5:34       ` Parth Shah
  0 siblings, 0 replies; 57+ messages in thread
From: Parth Shah @ 2019-10-18  5:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Hillf Danton



On 10/16/19 5:26 PM, Vincent Guittot wrote:
> On Wed, 16 Oct 2019 at 09:21, Parth Shah <parth@linux.ibm.com> wrote:
>>
>>
>>
>> On 9/19/19 1:03 PM, Vincent Guittot wrote:
>>
>> [...]
>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>  kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
>>>  1 file changed, 380 insertions(+), 205 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 017aad0..d33379c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>>>
>>>  enum fbq_type { regular, remote, all };
>>>
>>> +/*
>>> + * group_type describes the group of CPUs at the moment of the load balance.
>>> + * The enum is ordered by pulling priority, with the group with lowest priority
>>> + * first so the groupe_type can be simply compared when selecting the busiest
>>> + * group. see update_sd_pick_busiest().
>>> + */
>>>  enum group_type {
>>> -     group_other = 0,
>>> +     group_has_spare = 0,
>>> +     group_fully_busy,
>>>       group_misfit_task,
>>> +     group_asym_packing,
>>>       group_imbalanced,
>>> -     group_overloaded,
>>> +     group_overloaded
>>> +};
>>> +
>>> +enum migration_type {
>>> +     migrate_load = 0,
>>> +     migrate_util,
>>> +     migrate_task,
>>> +     migrate_misfit
>>>  };
>>>
>>>  #define LBF_ALL_PINNED       0x01
>>> @@ -7115,7 +7130,7 @@ struct lb_env {
>>>       unsigned int            loop_max;
>>>
>>>       enum fbq_type           fbq_type;
>>> -     enum group_type         src_grp_type;
>>> +     enum migration_type     balance_type;
>>>       struct list_head        tasks;
>>>  };
>>>
>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>>  {
>>>       struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>       struct task_struct *p;
>>> -     unsigned long load;
>>> +     unsigned long util, load;
>>>       int detached = 0;
>>>
>>>       lockdep_assert_held(&env->src_rq->lock);
>>> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
>>>               if (!can_migrate_task(p, env))
>>>                       goto next;
>>>
>>> -             load = task_h_load(p);
>>> +             switch (env->balance_type) {
>>> +             case migrate_load:
>>> +                     load = task_h_load(p);
>>>
>>> -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
>>> -                     goto next;
>>> +                     if (sched_feat(LB_MIN) &&
>>> +                         load < 16 && !env->sd->nr_balance_failed)
>>> +                             goto next;
>>>
>>> -             if ((load / 2) > env->imbalance)
>>> -                     goto next;
>>> +                     if ((load / 2) > env->imbalance)
>>> +                             goto next;
>>> +
>>> +                     env->imbalance -= load;
>>> +                     break;
>>> +
>>> +             case migrate_util:
>>> +                     util = task_util_est(p);
>>> +
>>> +                     if (util > env->imbalance)
>>
>> Can you please explain what would happen for
>> `if (util/2 > env->imbalance)` ?
>> just like when migrating load, even util shouldn't be migrated if
>> env->imbalance is just near the utilization of the task being moved, isn't it?
> 
> I have chosen uti and not util/2 to be conservative because
> migrate_util is used to fill spare capacity.
> With `if (util/2 > env->imbalance)`, we can more easily overload the
> local group or pick too much utilization from the overloaded group.
> 

fair enough. I missed the point that unlike migrate_load, with
migrate_util, env->imbalance is just spare capacity of the local group.

Thanks,
Parth

>>
>>> +                             goto next;
>>> +
>>> +                     env->imbalance -= util;
>>> +                     break;
>>> +[ ... ]
>>
>> Thanks,
>> Parth
>>


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2019-10-18  5:34 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
2019-09-27 23:57   ` Rik van Riel
2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-09-27 23:59   ` Rik van Riel
2019-10-01 17:11   ` Valentin Schneider
2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-09-28  0:05   ` Rik van Riel
2019-10-01 17:12   ` Valentin Schneider
2019-10-02  6:28     ` Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
2019-09-30  1:12   ` Rik van Riel
2019-09-30  7:44     ` Vincent Guittot
2019-09-30 16:24   ` Dietmar Eggemann
2019-10-01  8:14     ` Vincent Guittot
2019-10-01 16:52       ` Dietmar Eggemann
2019-10-02  6:44         ` Vincent Guittot
2019-10-02  9:21           ` Dietmar Eggemann
2019-10-08 13:02             ` Peter Zijlstra
2019-10-02  8:23         ` Vincent Guittot
2019-10-02  9:24           ` Dietmar Eggemann
2019-10-01  8:15   ` Dietmar Eggemann
2019-10-01  9:14     ` Vincent Guittot
2019-10-01 16:57       ` Dietmar Eggemann
2019-10-01 17:47   ` Valentin Schneider
2019-10-02  8:30     ` Vincent Guittot
2019-10-02 10:47       ` Valentin Schneider
2019-10-08 14:16         ` Peter Zijlstra
2019-10-08 14:34           ` Valentin Schneider
2019-10-08 15:30             ` Vincent Guittot
2019-10-08 15:48               ` Valentin Schneider
2019-10-08 17:39               ` Peter Zijlstra
2019-10-08 18:45                 ` Vincent Guittot
2019-10-08 16:33             ` Peter Zijlstra
2019-10-08 16:39               ` Valentin Schneider
2019-10-08 17:36                 ` Valentin Schneider
2019-10-08 17:55   ` Peter Zijlstra
2019-10-08 18:47     ` Vincent Guittot
2019-10-16  7:21   ` Parth Shah
2019-10-16 11:56     ` Vincent Guittot
2019-10-18  5:34       ` Parth Shah
2019-09-19  7:33 ` [PATCH v3 05/10] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 06/10] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 07/10] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 08/10] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-01 17:12   ` Valentin Schneider
2019-09-19  7:33 ` [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-07 15:14   ` Rik van Riel
2019-10-07 15:27     ` Vincent Guittot
2019-10-07 18:06       ` Rik van Riel
2019-09-19  7:33 ` [PATCH v3 10/10] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-08 14:32 ` [PATCH v3 0/8] sched/fair: rework the CFS load balance Phil Auld
2019-10-08 15:53   ` Vincent Guittot
2019-10-09 19:33     ` Phil Auld
2019-10-10  8:20       ` Vincent Guittot
2019-10-16  7:21 ` Parth Shah
2019-10-16 11:51   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).