linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] sched/fair: rework the CFS load balance
@ 2019-08-01 14:40 Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 1/8] sched/fair: clean up asym packing Vincent Guittot
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

Several wrong task placement have been raised with the current load
balance algorithm but their fixes are not always straight forward and
end up with using biased values to force migrations. A cleanup and rework
of the load balance will help to handle such UCs and enable to fine grain
the behavior of the scheduler for other cases.

Patch 1 has already been sent separatly and only consolidate asym policy
in one place and help the review of the changes in load_balance.

Patch 2 renames the sum of h_nr_running in stats.

Patch 3 removes meaningless imbalance computation to make review of
patch 4 easier.

Patch 4 reworks load_balance algorithm and fixes some wrong task placement
but try to stay conservative.

Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
into account when pulling tasks.

Patch 6 replaces runnable_load by load now that the metrics is only used
when overloaded.

Patch 7 improves the spread of tasks at the 1st scheduling level.

Patch 8 uses utilization instead of load in all steps of misfit task
path.

Some benchmarks results based on 8 iterations of each tests:
- small arm64 dual quad cores system

           tip/sched/core        w/ this patchset    improvement
schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)

hackbench
1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)

- large arm64 2 nodes / 224 cores system

           tip/sched/core        w/ this patchset    improvement
schedpipe     125665 +/-0.61%       125455 +/-0.62%   (-0.17%)

hackbench -l (256000/#grp) -g #grp
1 groups      15.263 +/-3.53%       13.776 +/-3.30%   (+9.74%)
4 groups       5.852 +/-0.57%        5.340 +/-8.03%   (+8.75%)
16 groups      3.097 +/-1.08%        3.246 +/-0.97%   (-4.81%)
32 groups      2.882 +/-1.04%        2.845 +/-1.02%   (+1.29%)
64 groups      2.809 +/-1.30%        2.712 +/-1.17%   (+3.45%)
128 groups     3.129 +/-9.74%        2.813 +/-6.22%   (+9.11%)
256 groups     3.559 +/-11.07%       3.020 +/-1.75%  (+15.15%)

dbench
1 groups     330.897 +/-0.27%      330.612 +/-0.77%   (-0.09%)
4 groups     932.922 +/-0.54%      941.817 +/*1.10%   (+0.95%)
16 groups   1932.346 +/-1.37%     1962.944 +/-0.62%   (+1.58%)
32 groups   2251.079 +/-7.93%     2418.531 +/-0.69%   (+7.44%)
64 groups   2104.114 +/-9.67%     2348.698 +/-11.24% (+11.62%)
128 groups  2093.756 +/-7.26%     2278.156 +/-9.74%   (+8.81%)
256 groups  1216.736 +/-2.46%     1665.774 +/-4.68%  (+36.91%)

tip/sched/core sha1:
  a1dc0446d649 ('sched/core: Silence a warning in sched_init()')

Changes since v1:
- fixed some bugs
- Used switch case
- Renamed env->src_grp_type to env->balance_type
- split patches in smaller ones
- added comments

Vincent Guittot (8):
  sched/fair: clean up asym packing
  sched/fair: rename sum_nr_running to sum_h_nr_running
  sched/fair: remove meaningless imbalance calculation
  sched/fair: rework load_balance
  sched/fair: use rq->nr_running when balancing load
  sched/fair: use load instead of runnable load
  sched/fair: evenly spread tasks when not overloaded
  sched/fair: use utilization to select misfit task

 kernel/sched/fair.c  | 769 ++++++++++++++++++++++++++++-----------------------
 kernel/sched/sched.h |   2 +-
 2 files changed, 419 insertions(+), 352 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/8] sched/fair: clean up asym packing
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 2/8] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

Clean up asym packing to follow the default load balance behavior:
- classify the group by creating a group_asym_capacity field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb75c0b..b432349 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7743,6 +7743,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	int group_asym_capacity;
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -8197,9 +8198,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * ASYM_PACKING needs to move all the work to the highest
 	 * prority CPUs in the group, therefore mark all groups
 	 * of lower priority than ourself as busy.
+	 *
+	 * This is primarily intended to used at the sibling level.  Some
+	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
+	 * case of POWER7, it can move to lower SMT modes only when higher
+	 * threads are idle.  When in lower SMT modes, the threads will
+	 * perform better since they share less core resources.  Hence when we
+	 * have idle threads, we want them to be the higher ones.
 	 */
 	if (sgs->sum_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+		sgs->group_asym_capacity = 1;
 		if (!sds->busiest)
 			return true;
 
@@ -8341,51 +8350,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * check_asym_packing - Check to see if the group is packed into the
- *			sched domain.
- *
- * This is primarily intended to used at the sibling level.  Some
- * cores like POWER7 prefer to use lower numbered SMT threads.  In the
- * case of POWER7, it can move to lower SMT modes only when higher
- * threads are idle.  When in lower SMT modes, the threads will
- * perform better since they share less core resources.  Hence when we
- * have idle threads, we want them to be the higher ones.
- *
- * This packing function is run on idle threads.  It checks to see if
- * the busiest CPU in this domain (core in the P7 case) has a higher
- * CPU number than the packing function is being run on.  Here we are
- * assuming lower CPU number will be equivalent to lower a SMT thread
- * number.
- *
- * Return: 1 when packing is required and a task should be moved to
- * this CPU.  The amount of the imbalance is returned in env->imbalance.
- *
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain which is to be packed
- */
-static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	int busiest_cpu;
-
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return 0;
-
-	if (env->idle == CPU_NOT_IDLE)
-		return 0;
-
-	if (!sds->busiest)
-		return 0;
-
-	busiest_cpu = sds->busiest->asym_prefer_cpu;
-	if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
-		return 0;
-
-	env->imbalance = sds->busiest_stat.group_load;
-
-	return 1;
-}
-
-/**
  * fix_small_imbalance - Calculate the minor imbalance that exists
  *			amongst the groups of a sched_domain, during
  *			load balancing.
@@ -8469,6 +8433,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	if (busiest->group_asym_capacity) {
+		env->imbalance = busiest->group_load;
+		return;
+	}
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -8573,8 +8542,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	busiest = &sds.busiest_stat;
 
 	/* ASYM feature bypasses nice load balance check */
-	if (check_asym_packing(env, &sds))
-		return sds.busiest;
+	if (busiest->group_asym_capacity)
+		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || busiest->sum_nr_running == 0)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/8] sched/fair: rename sum_nr_running to sum_h_nr_running
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 1/8] sched/fair: clean up asym packing Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 3/8] sched/fair: remove meaningless imbalance calculation Vincent Guittot
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b432349..d7f76b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7738,7 +7738,7 @@ struct sg_lb_stats {
 	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
-	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int sum_h_nr_running; /* Nr tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
@@ -7783,7 +7783,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.total_capacity = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
-			.sum_nr_running = 0,
+			.sum_h_nr_running = 0,
 			.group_type = group_other,
 		},
 	};
@@ -7974,7 +7974,7 @@ static inline int sg_imbalanced(struct sched_group *group)
 static inline bool
 group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running < sgs->group_weight)
+	if (sgs->sum_h_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
@@ -7995,7 +7995,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running <= sgs->group_weight)
+	if (sgs->sum_h_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
@@ -8087,7 +8087,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_runnable_load(rq);
 		sgs->group_util += cpu_util(i);
-		sgs->sum_nr_running += rq->cfs.h_nr_running;
+		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
@@ -8117,8 +8117,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
-	if (sgs->sum_nr_running)
-		sgs->load_per_task = sgs->group_load / sgs->sum_nr_running;
+	if (sgs->sum_h_nr_running)
+		sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
 
 	sgs->group_weight = group->group_weight;
 
@@ -8175,7 +8175,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * capable CPUs may harm throughput. Maximize throughput,
 	 * power/energy consequences are not considered.
 	 */
-	if (sgs->sum_nr_running <= sgs->group_weight &&
+	if (sgs->sum_h_nr_running <= sgs->group_weight &&
 	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
@@ -8206,7 +8206,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * perform better since they share less core resources.  Hence when we
 	 * have idle threads, we want them to be the higher ones.
 	 */
-	if (sgs->sum_nr_running &&
+	if (sgs->sum_h_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
 		sgs->group_asym_capacity = 1;
 		if (!sds->busiest)
@@ -8224,9 +8224,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 #ifdef CONFIG_NUMA_BALANCING
 static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running > sgs->nr_numa_running)
+	if (sgs->sum_h_nr_running > sgs->nr_numa_running)
 		return regular;
-	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+	if (sgs->sum_h_nr_running > sgs->nr_preferred_running)
 		return remote;
 	return all;
 }
@@ -8301,7 +8301,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		 */
 		if (prefer_sibling && sds->local &&
 		    group_has_capacity(env, local) &&
-		    (sgs->sum_nr_running > local->sum_nr_running + 1)) {
+		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
 			sgs->group_no_capacity = 1;
 			sgs->group_type = group_classify(sg, sgs);
 		}
@@ -8313,7 +8313,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_nr_running;
+		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
@@ -8367,7 +8367,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (!local->sum_nr_running)
+	if (!local->sum_h_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)
 		imbn = 1;
@@ -8465,7 +8465,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
+		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
 		if (load_above_capacity > busiest->group_capacity) {
 			load_above_capacity -= busiest->group_capacity;
 			load_above_capacity *= scale_load_down(NICE_0_LOAD);
@@ -8546,7 +8546,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_nr_running == 0)
+	if (!sds.busiest || busiest->sum_h_nr_running == 0)
 		goto out_balanced;
 
 	/* XXX broken for overlapping NUMA groups */
@@ -8868,7 +8868,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	env.src_rq = busiest;
 
 	ld_moved = 0;
-	if (busiest->nr_running > 1) {
+	if (busiest->cfs.h_nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/8] sched/fair: remove meaningless imbalance calculation
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 1/8] sched/fair: clean up asym packing Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 2/8] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

clean up load_balance and remove meaningless calculation and fields before
adding new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 105 +---------------------------------------------------
 1 file changed, 1 insertion(+), 104 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f76b0..d7f4a7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5450,18 +5450,6 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
-static unsigned long cpu_avg_load_per_task(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
-	unsigned long load_avg = cpu_runnable_load(rq);
-
-	if (nr_running)
-		return load_avg / nr_running;
-
-	return 0;
-}
-
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -7735,7 +7723,6 @@ static unsigned long task_h_load(struct task_struct *p)
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
 	unsigned int sum_h_nr_running; /* Nr tasks running in the group */
@@ -8117,9 +8104,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
-	if (sgs->sum_h_nr_running)
-		sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
-
 	sgs->group_weight = group->group_weight;
 
 	sgs->group_no_capacity = group_is_overloaded(env, sgs);
@@ -8350,76 +8334,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * fix_small_imbalance - Calculate the minor imbalance that exists
- *			amongst the groups of a sched_domain, during
- *			load balancing.
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
- */
-static inline
-void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	unsigned long tmp, capa_now = 0, capa_move = 0;
-	unsigned int imbn = 2;
-	unsigned long scaled_busy_load_per_task;
-	struct sg_lb_stats *local, *busiest;
-
-	local = &sds->local_stat;
-	busiest = &sds->busiest_stat;
-
-	if (!local->sum_h_nr_running)
-		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
-	else if (busiest->load_per_task > local->load_per_task)
-		imbn = 1;
-
-	scaled_busy_load_per_task =
-		(busiest->load_per_task * SCHED_CAPACITY_SCALE) /
-		busiest->group_capacity;
-
-	if (busiest->avg_load + scaled_busy_load_per_task >=
-	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
-		env->imbalance = busiest->load_per_task;
-		return;
-	}
-
-	/*
-	 * OK, we don't have enough imbalance to justify moving tasks,
-	 * however we may be able to increase total CPU capacity used by
-	 * moving them.
-	 */
-
-	capa_now += busiest->group_capacity *
-			min(busiest->load_per_task, busiest->avg_load);
-	capa_now += local->group_capacity *
-			min(local->load_per_task, local->avg_load);
-	capa_now /= SCHED_CAPACITY_SCALE;
-
-	/* Amount of load we'd subtract */
-	if (busiest->avg_load > scaled_busy_load_per_task) {
-		capa_move += busiest->group_capacity *
-			    min(busiest->load_per_task,
-				busiest->avg_load - scaled_busy_load_per_task);
-	}
-
-	/* Amount of load we'd add */
-	if (busiest->avg_load * busiest->group_capacity <
-	    busiest->load_per_task * SCHED_CAPACITY_SCALE) {
-		tmp = (busiest->avg_load * busiest->group_capacity) /
-		      local->group_capacity;
-	} else {
-		tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
-		      local->group_capacity;
-	}
-	capa_move += local->group_capacity *
-		    min(local->load_per_task, local->avg_load + tmp);
-	capa_move /= SCHED_CAPACITY_SCALE;
-
-	/* Move if we gain throughput */
-	if (capa_move > capa_now)
-		env->imbalance = busiest->load_per_task;
-}
-
-/**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
  * @env: load balance environment
@@ -8438,15 +8352,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure CPU-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
@@ -8457,7 +8362,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	    (busiest->avg_load <= sds->avg_load ||
 	     local->avg_load >= sds->avg_load)) {
 		env->imbalance = 0;
-		return fix_small_imbalance(env, sds);
+		return;
 	}
 
 	/*
@@ -8495,14 +8400,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 				       busiest->group_misfit_task_load);
 	}
 
-	/*
-	 * if *imbalance is less than the average load per runnable task
-	 * there is no guarantee that any tasks will be moved so we'll have
-	 * a think about bumping its value to force at least one task to be
-	 * moved
-	 */
-	if (env->imbalance < busiest->load_per_task)
-		return fix_small_imbalance(env, sds);
 }
 
 /******* find_busiest_group() helpers end here *********************/
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (2 preceding siblings ...)
  2019-08-01 14:40 ` [PATCH v2 3/8] sched/fair: remove meaningless imbalance calculation Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-05 17:07   ` Valentin Schneider
                     ` (2 more replies)
  2019-08-01 14:40 ` [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load Vincent Guittot
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

The load_balance algorithm contains some heuristics which have becomes
meaningless since the rework of metrics and the introduction of PELT.

Furthermore, it's sometimes difficult to fix wrong scheduling decisions
because everything is based on load whereas some imbalances are not
related to the load. The current algorithm ends up to create virtual and
meaningless value like the avg_load_per_task or tweaks the state of a
group to make it overloaded whereas it's not, in order to try to migrate
tasks.

load_balance should better qualify the imbalance of the group and define
cleary what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have :
	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_capacity
	group_imbalanced
	group_overloaded

Based on the type of sched_group, load_balance now sets what it wants to
move in order to fix the imnbalance. It can be some load as before but
also some utilization, a number of task or a type of task:
	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance algorithm fixes several pending wrong tasks
placement:
- the 1 task per CPU case with asymetrics system
- the case of cfs task preempted by other class
- the case of tasks not evenly spread on groups with spare capacity

The load balance decisions have been gathered in 3 functions:
- update_sd_pick_busiest() select the busiest sched_group.
- find_busiest_group() checks if there is an imabalance between local and
  busiest group.
- calculate_imbalance() decides what have to be moved.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 581 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 379 insertions(+), 202 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f4a7e..a8681c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7136,13 +7136,28 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+/*
+ * group_type describes the group of CPUs at the moment of the load balance.
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the groupe_type can be simply compared when selecting the busiest
+ * group. see update_sd_pick_busiest().
+ */
 enum group_type {
-	group_other = 0,
+	group_has_spare = 0,
+	group_fully_busy,
 	group_misfit_task,
+	group_asym_capacity,
 	group_imbalanced,
 	group_overloaded,
 };
 
+enum migration_type {
+	migrate_task = 0,
+	migrate_util,
+	migrate_load,
+	migrate_misfit,
+};
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -7173,7 +7188,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
-	enum group_type		src_grp_type;
+	enum migration_type	balance_type;
 	struct list_head	tasks;
 };
 
@@ -7405,7 +7420,7 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
-	unsigned long load;
+	unsigned long util, load;
 	int detached = 0;
 
 	lockdep_assert_held(&env->src_rq->lock);
@@ -7438,19 +7453,53 @@ static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		load = task_h_load(p);
+		switch (env->balance_type) {
+		case migrate_task:
+			/* Migrate task */
+			env->imbalance--;
+			break;
 
-		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
-			goto next;
+		case migrate_util:
+			util = task_util_est(p);
 
-		if ((load / 2) > env->imbalance)
-			goto next;
+			if (util > env->imbalance)
+				goto next;
+
+			env->imbalance -= util;
+			break;
+
+		case migrate_load:
+			load = task_h_load(p);
+
+			if (sched_feat(LB_MIN) &&
+			    load < 16 && !env->sd->nr_balance_failed)
+				goto next;
+
+			if ((load / 2) > env->imbalance)
+				goto next;
+
+			env->imbalance -= load;
+			break;
+
+		case migrate_misfit:
+			load = task_h_load(p);
+
+			/*
+			 * utilization of misfit task might decrease a bit
+			 * since it has been recorded. Be conservative in the
+			 * condition.
+			 */
+			if (load < env->imbalance)
+				goto next;
+
+			env->imbalance = 0;
+			break;
+		}
 
 		detach_task(p, env);
 		list_add(&p->se.group_node, &env->tasks);
 
 		detached++;
-		env->imbalance -= load;
 
 #ifdef CONFIG_PREEMPT
 		/*
@@ -7729,8 +7778,7 @@ struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_no_capacity;
-	int group_asym_capacity;
+	unsigned int group_asym_capacity; /* tasks should be move to preferred cpu */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -7745,10 +7793,10 @@ struct sg_lb_stats {
 struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
-	unsigned long total_running;
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned int prefer_sibling; /* tasks should go to sibling first */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7765,13 +7813,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	*sds = (struct sd_lb_stats){
 		.busiest = NULL,
 		.local = NULL,
-		.total_running = 0UL,
 		.total_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
-			.avg_load = 0UL,
-			.sum_h_nr_running = 0,
-			.group_type = group_other,
+			.idle_cpus = UINT_MAX,
+			.group_type = group_has_spare,
 		},
 	};
 }
@@ -8013,19 +8059,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 }
 
 static inline enum
-group_type group_classify(struct sched_group *group,
+group_type group_classify(struct lb_env *env,
+			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
-	if (sgs->group_no_capacity)
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_asym_capacity)
+		return group_asym_capacity;
+
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
-	return group_other;
+	if (!group_has_capacity(env, sgs))
+		return group_fully_busy;
+
+	return group_has_spare;
 }
 
 static bool update_nohz_stats(struct rq *rq, bool force)
@@ -8062,10 +8115,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
 {
-	int i, nr_running;
+	int i, nr_running, local_group;
 
 	memset(sgs, 0, sizeof(*sgs));
 
+	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
@@ -8090,9 +8145,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-		if (!nr_running && idle_cpu(i))
+		if (!nr_running && idle_cpu(i)) {
 			sgs->idle_cpus++;
+			/* Idle cpu can't have misfit task */
+			continue;
+		}
+
+		if (local_group)
+			continue;
 
+		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
@@ -8100,14 +8162,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
-	/* Adjust by relative CPU capacity of the group */
+	/* Check if dst cpu is idle and preferred to this group */
+	if (env->sd->flags & SD_ASYM_PACKING &&
+	    env->idle != CPU_NOT_IDLE &&
+	    sgs->sum_h_nr_running &&
+	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+		sgs->group_asym_capacity = 1;
+	}
+
 	sgs->group_capacity = group->sgc->capacity;
-	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_no_capacity = group_is_overloaded(env, sgs);
-	sgs->group_type = group_classify(group, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
+
+	/* Computing avg_load makes sense only when group is overloaded */
+	if (sgs->group_type == group_overloaded)
+		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+				sgs->group_capacity;
 }
 
 /**
@@ -8130,6 +8202,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+
+	/* Make sure that there is at least one task to pull */
+	if (!sgs->sum_h_nr_running)
+		return false;
 	/*
 	 * Don't try to pull misfit tasks we can't help.
 	 * We can use max_capacity here as reduction in capacity on some
@@ -8138,7 +8214,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
-	     !group_has_capacity(env, &sds->local_stat)))
+	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
 	if (sgs->group_type > busiest->group_type)
@@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
+	/*
+	 * The candidate and the current busiest group are the same type of
+	 * group. Let check which one is the busiest according to the type.
+	 */
+
+	switch (sgs->group_type) {
+	case group_overloaded:
+		/* Select the overloaded group with highest avg_load. */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_imbalanced:
+		/* Select the 1st imbalanced group as we don't have
+		 * any way to choose one more than another
+		 */
 		return false;
+		break;
 
-	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
-		goto asym_packing;
+	case group_asym_capacity:
+		/* Prefer to move from lowest priority CPU's work */
+		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
+			 return false;
+		break;
+
+	case group_misfit_task:
+		/*
+		 * If we have more than one misfit sg go with the
+		 * biggest misfit.
+		 */
+		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+			return false;
+		break;
+
+	case group_fully_busy:
+		/*
+		 * Select the fully busy group with highest avg_load.
+		 * In theory, there is no need to pull task from such
+		 * kind of group because tasks have all compute
+		 * capacity that they need but we can still improve the
+		 * overall throughput by reducing contention when
+		 * accessing shared HW resources.
+		 * XXX for now avg_load is not computed and always 0 so
+		 * we select the 1st one.
+		 */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_has_spare:
+		/*
+		 * Select not overloaded group with lowest number of
+		 * idle cpus. We could also compare the spare capacity
+		 * which is more stable but it can end up that the
+		 * group has less spare capacity but finally more idle
+		 * cpus which means less opportunity to pull tasks.
+		 */
+		if (sgs->idle_cpus >= busiest->idle_cpus)
+			return false;
+		break;
+	}
 
 	/*
 	 * Candidate sg has no more than one task per CPU and
@@ -8159,50 +8291,12 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * capable CPUs may harm throughput. Maximize throughput,
 	 * power/energy consequences are not considered.
 	 */
-	if (sgs->sum_h_nr_running <= sgs->group_weight &&
-	    group_smaller_min_cpu_capacity(sds->local, sg))
-		return false;
-
-	/*
-	 * If we have more than one misfit sg go with the biggest misfit.
-	 */
-	if (sgs->group_type == group_misfit_task &&
-	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (group_smaller_min_cpu_capacity(sds->local, sg)))
 		return false;
 
-asym_packing:
-	/* This is the busiest node in its class. */
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return true;
-
-	/* No ASYM_PACKING if target CPU is already busy */
-	if (env->idle == CPU_NOT_IDLE)
-		return true;
-	/*
-	 * ASYM_PACKING needs to move all the work to the highest
-	 * prority CPUs in the group, therefore mark all groups
-	 * of lower priority than ourself as busy.
-	 *
-	 * This is primarily intended to used at the sibling level.  Some
-	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
-	 * case of POWER7, it can move to lower SMT modes only when higher
-	 * threads are idle.  When in lower SMT modes, the threads will
-	 * perform better since they share less core resources.  Hence when we
-	 * have idle threads, we want them to be the higher ones.
-	 */
-	if (sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
-		sgs->group_asym_capacity = 1;
-		if (!sds->busiest)
-			return true;
-
-		/* Prefer to move from lowest priority CPU's work */
-		if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
-				      sg->asym_prefer_cpu))
-			return true;
-	}
-
-	return false;
+	return true;
 }
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -8240,13 +8334,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  * @env: The load balancing environment.
  * @sds: variable to hold the statistics for this sched_domain.
  */
+
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 	int sg_status = 0;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -8273,22 +8367,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (local_group)
 			goto next_group;
 
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks. The extra check prevents the case where
-		 * you always pull from the heaviest group when it is already
-		 * under-utilized (possible with a large weight task outweighs
-		 * the tasks on the system).
-		 */
-		if (prefer_sibling && sds->local &&
-		    group_has_capacity(env, local) &&
-		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
-			sgs->group_no_capacity = 1;
-			sgs->group_type = group_classify(sg, sgs);
-		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -8297,13 +8375,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	/* Tag domain that child domain prefers tasks go to siblings first */
+	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+
 #ifdef CONFIG_NO_HZ_COMMON
 	if ((env->flags & LBF_NOHZ_AGAIN) &&
 	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
@@ -8341,69 +8421,132 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
+	if (busiest->group_type == group_imbalanced) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure CPU-load equilibrium, try to move any task to fix
+		 * the imbalance. The next load balance will take care of
+		 * balancing back the system.
+		 */
+		env->balance_type = migrate_task;
+		env->imbalance = 1;
+		return;
+	}
 
-	if (busiest->group_asym_capacity) {
+	if (busiest->group_type == group_asym_capacity) {
+		/*
+		 * In case of asym capacity, we will try to migrate all load
+		 * to the preferred CPU
+		 */
+		env->balance_type = migrate_load;
 		env->imbalance = busiest->group_load;
 		return;
 	}
 
+	if (busiest->group_type == group_misfit_task) {
+		/* Set imbalance to allow misfit task to be balanced. */
+		env->balance_type = migrate_misfit;
+		env->imbalance = busiest->group_misfit_task_load;
+		return;
+	}
+
 	/*
-	 * Avg load of busiest sg can be less and avg load of local sg can
-	 * be greater than avg load across all sgs of sd because avg load
-	 * factors in sg capacity and sgs with smaller group_type are
-	 * skipped when updating the busiest sg:
+	 * Try to use spare capacity of local group without overloading it or
+	 * emptying busiest
 	 */
-	if (busiest->group_type != group_misfit_task &&
-	    (busiest->avg_load <= sds->avg_load ||
-	     local->avg_load >= sds->avg_load)) {
-		env->imbalance = 0;
+	if (local->group_type == group_has_spare) {
+		if (busiest->group_type > group_fully_busy) {
+			/*
+			 * If busiest is overloaded, try to fill spare
+			 * capacity. This might end up creating spare capacity
+			 * in busiest or busiest still being overloaded but
+			 * there is no simple way to directly compute the
+			 * amount of load to migrate in order to balance the
+			 * system.
+			 */
+			env->balance_type = migrate_util;
+			env->imbalance = max(local->group_capacity, local->group_util) -
+				    local->group_util;
+			return;
+		}
+
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->balance_type = migrate_task;
+			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+			return;
+		}
+
+		/*
+		 * If there is no overload, we just want to even the number of
+		 * idle cpus.
+		 */
+		env->balance_type = migrate_task;
+		env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
 		return;
 	}
 
 	/*
-	 * If there aren't any idle CPUs, avoid creating some.
+	 * Local is fully busy but have to take more load to relieve the
+	 * busiest group
 	 */
-	if (busiest->group_type == group_overloaded &&
-	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
-		if (load_above_capacity > busiest->group_capacity) {
-			load_above_capacity -= busiest->group_capacity;
-			load_above_capacity *= scale_load_down(NICE_0_LOAD);
-			load_above_capacity /= busiest->group_capacity;
-		} else
-			load_above_capacity = ~0UL;
+	if (local->group_type < group_overloaded) {
+		/*
+		 * Local will become overvloaded so the avg_load metrics are
+		 * finally needed
+		 */
+
+		local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
+				  local->group_capacity;
+
+		sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
+				sds->total_capacity;
 	}
 
 	/*
-	 * We're trying to get all the CPUs to the average_load, so we don't
-	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded CPU below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group
-	 * capacity. Thus we look for the minimum possible imbalance.
+	 * Both group are or will become overloaded and we're trying to get
+	 * all the CPUs to the average_load, so we don't want to push
+	 * ourselves above the average load, nor do we wish to reduce the
+	 * max loaded CPU below the average load. At the same time, we also
+	 * don't want to reduce the group load below the group capacity.
+	 * Thus we look for the minimum possible imbalance.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
-	/* How much load to actually move to equalise the imbalance */
+	env->balance_type = migrate_load;
 	env->imbalance = min(
-		max_pull * busiest->group_capacity,
+		(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
-
-	/* Boost imbalance to allow misfit task to be balanced. */
-	if (busiest->group_type == group_misfit_task) {
-		env->imbalance = max_t(long, env->imbalance,
-				       busiest->group_misfit_task_load);
-	}
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
 
+/*
+ * Decision matrix according to the local and busiest group state
+ *
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
+ * misfit_task      force     N/A        N/A    N/A  force      force
+ * asym_capacity    force     force      N/A    N/A  force      force
+ * imbalanced       force     force      N/A    N/A  force      force
+ * overloaded       force     force      N/A    N/A  force      avg_load
+ *
+ * N/A :      Not Applicable because already filtered while updating
+ *            statistics.
+ * balanced : The system is balanced for these 2 groups.
+ * force :    Calculate the imbalance as load migration is probably needed.
+ * avg_load : Only if imbalance is significant enough.
+ * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
+ *            different in groups.
+ */
+
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance.
@@ -8438,17 +8581,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
-	/* ASYM feature bypasses nice load balance check */
-	if (busiest->group_asym_capacity)
-		goto force_balance;
-
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || busiest->sum_h_nr_running == 0)
 		goto out_balanced;
 
-	/* XXX broken for overlapping NUMA groups */
-	sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
-						/ sds.total_capacity;
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
+	/* ASYM feature bypasses nice load balance check */
+	if (busiest->group_type == group_asym_capacity)
+		goto force_balance;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/*
-	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
-	 * capacities from resulting in underutilization due to avg_load.
-	 */
-	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
-	    busiest->group_no_capacity)
-		goto force_balance;
-
-	/* Misfit tasks should be dealt with regardless of the avg load */
-	if (busiest->group_type == group_misfit_task)
-		goto force_balance;
-
-	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (local->avg_load >= busiest->avg_load)
+	if (local->group_type > busiest->group_type)
 		goto out_balanced;
 
 	/*
-	 * Don't pull any tasks if this group is already above the domain
-	 * average load.
+	 * When groups are overloaded, use the avg_load to ensure fairness
+	 * between tasks.
 	 */
-	if (local->avg_load >= sds.avg_load)
-		goto out_balanced;
+	if (local->group_type == group_overloaded) {
+		/*
+		 * If the local group is more loaded than the selected
+		 * busiest group don't try and pull any tasks.
+		 */
+		if (local->avg_load >= busiest->avg_load)
+			goto out_balanced;
+
+		/* XXX broken for overlapping NUMA groups */
+		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
+				sds.total_capacity;
 
-	if (env->idle == CPU_IDLE) {
 		/*
-		 * This CPU is idle. If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * Don't pull any tasks if this group is already above the
+		 * domain average load.
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
+		if (local->avg_load >= sds.avg_load)
 			goto out_balanced;
-	} else {
+
 		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
+		 * If the busiest group is more loaded, use imbalance_pct to be
+		 * conservative.
 		 */
 		if (100 * busiest->avg_load <=
 				env->sd->imbalance_pct * local->avg_load)
 			goto out_balanced;
+
 	}
 
+	/* Try to move all excess tasks to child's sibling domain */
+	if (sds.prefer_sibling && local->group_type == group_has_spare &&
+	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+		goto force_balance;
+
+	if (busiest->group_type != group_overloaded &&
+	     (env->idle == CPU_NOT_IDLE ||
+	      local->idle_cpus <= (busiest->idle_cpus + 1)))
+		/*
+		 * If the busiest group is not overloaded
+		 * and there is no imbalance between this and busiest group
+		 * wrt idle CPUs, it is balanced. The imbalance
+		 * becomes significant if the diff is greater than 1 otherwise
+		 * we might end up to just move the imbalance on another
+		 * group.
+		 */
+		goto out_balanced;
+
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
+
 	return env->imbalance ? sds.busiest : NULL;
 
 out_balanced:
+
 	env->imbalance = 0;
 	return NULL;
 }
@@ -8523,11 +8678,13 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long busiest_load = 0, busiest_capacity = 1;
+	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	unsigned int busiest_nr = 0;
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
-		unsigned long capacity, load;
+		unsigned long capacity, load, util;
+		unsigned int nr_running;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -8555,20 +8712,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		/*
-		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
-		 * seek the "biggest" misfit task.
-		 */
-		if (env->src_grp_type == group_misfit_task) {
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
-				busiest = rq;
-			}
-
-			continue;
-		}
-
 		capacity = capacity_of(i);
+		nr_running = rq->cfs.h_nr_running;
 
 		/*
 		 * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@@ -8578,35 +8723,67 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    capacity_of(env->dst_cpu) < capacity &&
-		    rq->nr_running == 1)
+		    nr_running == 1)
 			continue;
 
-		load = cpu_runnable_load(rq);
+		switch (env->balance_type) {
+		case migrate_task:
+			if (busiest_nr < nr_running) {
+				busiest_nr = nr_running;
+				busiest = rq;
+			}
+			break;
 
-		/*
-		 * When comparing with imbalance, use cpu_runnable_load()
-		 * which is not scaled with the CPU capacity.
-		 */
+		case migrate_util:
+			util = cpu_util(cpu_of(rq));
 
-		if (rq->nr_running == 1 && load > env->imbalance &&
-		    !check_cpu_capacity(rq, env->sd))
-			continue;
+			if (busiest_util < util) {
+				busiest_util = util;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_load:
+			/*
+			 * When comparing with load imbalance, use cpu_runnable_load()
+			 * which is not scaled with the CPU capacity.
+			 */
+			load = cpu_runnable_load(rq);
+
+			if (nr_running == 1 && load > env->imbalance &&
+			    !check_cpu_capacity(rq, env->sd))
+				break;
+
+			/*
+			 * For the load comparisons with the other CPU's, consider
+			 * the cpu_runnable_load() scaled with the CPU capacity, so
+			 * that the load can be moved away from the CPU that is
+			 * potentially running at a lower capacity.
+			 *
+			 * Thus we're looking for max(load_i / capacity_i), crosswise
+			 * multiplication to rid ourselves of the division works out
+			 * to: load_i * capacity_j > load_j * capacity_i;  where j is
+			 * our previous maximum.
+			 */
+			if (load * busiest_capacity > busiest_load * capacity) {
+				busiest_load = load;
+				busiest_capacity = capacity;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_misfit:
+			/*
+			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+			 * seek the "biggest" misfit task.
+			 */
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			break;
 
-		/*
-		 * For the load comparisons with the other CPU's, consider
-		 * the cpu_runnable_load() scaled with the CPU capacity, so
-		 * that the load can be moved away from the CPU that is
-		 * potentially running at a lower capacity.
-		 *
-		 * Thus we're looking for max(load_i / capacity_i), crosswise
-		 * multiplication to rid ourselves of the division works out
-		 * to: load_i * capacity_j > load_j * capacity_i;  where j is
-		 * our previous maximum.
-		 */
-		if (load * busiest_capacity > busiest_load * capacity) {
-			busiest_load = load;
-			busiest_capacity = capacity;
-			busiest = rq;
 		}
 	}
 
@@ -8652,7 +8829,7 @@ voluntary_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->src_grp_type == group_misfit_task)
+	if (env->balance_type == migrate_misfit)
 		return 1;
 
 	return 0;
@@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	env.src_rq = busiest;
 
 	ld_moved = 0;
-	if (busiest->cfs.h_nr_running > 1) {
+	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (3 preceding siblings ...)
  2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 6/8] sched/fair: use load instead of runnable load Vincent Guittot
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

cfs load_balance only takes care of CFS tasks whereas CPUs can be used by
other scheduling class. Typically, a CFS task preempted by a RT or deadline
task will not get a chance to be pulled on another CPU because the
load_balance doesn't take into account tasks from classes.
Add sum of nr_running in the statistics and use it to detect such
situation.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8681c3..f05f1ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7774,6 +7774,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
+	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
@@ -8007,7 +8008,7 @@ static inline int sg_imbalanced(struct sched_group *group)
 static inline bool
 group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_h_nr_running < sgs->group_weight)
+	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
@@ -8028,7 +8029,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_h_nr_running <= sgs->group_weight)
+	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
@@ -8132,6 +8133,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
+		sgs->sum_nr_running += nr_running;
+
 		if (nr_running > 1)
 			*sg_status |= SG_OVERLOAD;
 
@@ -8480,7 +8483,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			 * groups.
 			 */
 			env->balance_type = migrate_task;
-			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+			env->imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1;
 			return;
 		}
 
@@ -8643,7 +8646,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 	/* Try to move all excess tasks to child's sibling domain */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/8] sched/fair: use load instead of runnable load
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (4 preceding siblings ...)
  2019-08-01 14:40 ` [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-06 16:07   ` Peter Zijlstra
  2019-08-01 14:40 ` [PATCH v2 7/8] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

runnable load has been introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized group with huge blocked load whereas other groups were
overloaded.

The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wakeup on the cpu.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f05f1ad..dfaf0b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5445,6 +5445,11 @@ static unsigned long cpu_runnable_load(struct rq *rq)
 	return cfs_rq_runnable_load_avg(&rq->cfs);
 }
 
+static unsigned long cpu_load(struct rq *rq)
+{
+	return cfs_rq_load_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -5540,7 +5545,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	this_eff_load = cpu_runnable_load(cpu_rq(this_cpu));
+	this_eff_load = cpu_load(cpu_rq(this_cpu));
 
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
@@ -5558,7 +5563,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load *= 100;
 	this_eff_load *= capacity_of(prev_cpu);
 
-	prev_eff_load = cpu_runnable_load(cpu_rq(prev_cpu));
+	prev_eff_load = cpu_load(cpu_rq(prev_cpu));
 	prev_eff_load -= task_load;
 	if (sched_feat(WA_BIAS))
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
@@ -5646,7 +5651,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_span(group)) {
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			runnable_load += load;
 
 			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
@@ -5787,7 +5792,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				continue;
 			}
 
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			if (load < min_load) {
 				min_load = load;
 				least_loaded_cpu = i;
@@ -8128,7 +8133,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
 			env->flags |= LBF_NOHZ_AGAIN;
 
-		sgs->group_load += cpu_runnable_load(rq);
+		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
@@ -8569,7 +8574,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	init_sd_lb_stats(&sds);
 
 	/*
-	 * Compute the various statistics relavent for load balancing at
+	 * Compute the various statistics relevant for load balancing at
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
@@ -8748,10 +8753,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 		case migrate_load:
 			/*
-			 * When comparing with load imbalance, use cpu_runnable_load()
+			 * When comparing with load imbalance, use cpu_load()
 			 * which is not scaled with the CPU capacity.
 			 */
-			load = cpu_runnable_load(rq);
+			load = cpu_load(rq);
 
 			if (nr_running == 1 && load > env->imbalance &&
 			    !check_cpu_capacity(rq, env->sd))
@@ -8759,7 +8764,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 			/*
 			 * For the load comparisons with the other CPU's, consider
-			 * the cpu_runnable_load() scaled with the CPU capacity, so
+			 * the cpu_load() scaled with the CPU capacity, so
 			 * that the load can be moved away from the CPU that is
 			 * potentially running at a lower capacity.
 			 *
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/8] sched/fair: evenly spread tasks when not overloaded
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (5 preceding siblings ...)
  2019-08-01 14:40 ` [PATCH v2 6/8] sched/fair: use load instead of runnable load Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-01 14:40 ` [PATCH v2 8/8] sched/fair: use utilization to select misfit task Vincent Guittot
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

When there is only 1 cpu per group, using the idle cpus to evenly spread
tasks doesn't make sense and nr_running is a better metrics.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dfaf0b8..53e64a7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8654,18 +8654,34 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
 
-	if (busiest->group_type != group_overloaded &&
-	     (env->idle == CPU_NOT_IDLE ||
-	      local->idle_cpus <= (busiest->idle_cpus + 1)))
-		/*
-		 * If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance
-		 * becomes significant if the diff is greater than 1 otherwise
-		 * we might end up to just move the imbalance on another
-		 * group.
-		 */
-		goto out_balanced;
+	if (busiest->group_type != group_overloaded) {
+		if (env->idle == CPU_NOT_IDLE)
+			/*
+			 * If the busiest group is not overloaded (and as a
+			 * result the local one too) but this cpu is already
+			 * busy, let another idle cpu try to pull task.
+			 */
+			goto out_balanced;
+
+		if (busiest->group_weight > 1 &&
+		    local->idle_cpus <= (busiest->idle_cpus + 1))
+			/*
+			 * If the busiest group is not overloaded
+			 * and there is no imbalance between this and busiest
+			 * group wrt idle CPUs, it is balanced. The imbalance
+			 * becomes significant if the diff is greater than 1
+			 * otherwise we might end up to just move the imbalance
+			 * on another group. Of course this applies only if
+			 * there is more than 1 CPU per group.
+			 */
+			goto out_balanced;
+
+		if (busiest->sum_h_nr_running == 1)
+			/*
+			 * busiest doesn't have any tasks waiting to run
+			 */
+			goto out_balanced;
+	}
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 8/8] sched/fair: use utilization to select misfit task
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (6 preceding siblings ...)
  2019-08-01 14:40 ` [PATCH v2 7/8] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
@ 2019-08-01 14:40 ` Vincent Guittot
  2019-08-01 16:27   ` Valentin Schneider
  2019-08-02 12:56   ` [PATCH v3] " Vincent Guittot
  2019-08-29 19:23 ` [PATCH v2 0/8] sched/fair: rework the CFS load balance Phil Auld
       [not found] ` <20190809052124.13016-1-hdanton@sina.com>
  9 siblings, 2 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-01 14:40 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 28 ++++++++++++++--------------
 kernel/sched/sched.h |  2 +-
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53e64a7..d08cc12 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3817,16 +3817,16 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 		return;
 
 	if (!p) {
-		rq->misfit_task_load = 0;
+		rq->misfit_task_util = 0;
 		return;
 	}
 
 	if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
-		rq->misfit_task_load = 0;
+		rq->misfit_task_util = 0;
 		return;
 	}
 
-	rq->misfit_task_load = task_h_load(p);
+	rq->misfit_task_util = task_util_est(p);
 }
 
 #else /* CONFIG_SMP */
@@ -7487,14 +7487,14 @@ static int detach_tasks(struct lb_env *env)
 			break;
 
 		case migrate_misfit:
-			load = task_h_load(p);
+			util = task_util_est(p);
 
 			/*
 			 * utilization of misfit task might decrease a bit
 			 * since it has been recorded. Be conservative in the
 			 * condition.
 			 */
-			if (load < env->imbalance)
+			if (2*util < env->imbalance)
 				goto next;
 
 			env->imbalance = 0;
@@ -7785,7 +7785,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	unsigned int group_asym_capacity; /* tasks should be move to preferred cpu */
-	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+	unsigned long group_misfit_task_util; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -7959,7 +7959,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
  */
 static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
 {
-	return rq->misfit_task_load &&
+	return rq->misfit_task_util &&
 		(rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
 		 check_cpu_capacity(rq, sd));
 }
@@ -8078,7 +8078,7 @@ group_type group_classify(struct lb_env *env,
 	if (sgs->group_asym_capacity)
 		return group_asym_capacity;
 
-	if (sgs->group_misfit_task_load)
+	if (sgs->group_misfit_task_util)
 		return group_misfit_task;
 
 	if (!group_has_capacity(env, sgs))
@@ -8164,8 +8164,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load) {
-			sgs->group_misfit_task_load = rq->misfit_task_load;
+		    sgs->group_misfit_task_util < rq->misfit_task_util) {
+			sgs->group_misfit_task_util = rq->misfit_task_util;
 			*sg_status |= SG_OVERLOAD;
 		}
 	}
@@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		 * If we have more than one misfit sg go with the
 		 * biggest misfit.
 		 */
-		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+		if (sgs->group_misfit_task_util < busiest->group_misfit_task_util)
 			return false;
 		break;
 
@@ -8458,7 +8458,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_misfit_task) {
 		/* Set imbalance to allow misfit task to be balanced. */
 		env->balance_type = migrate_misfit;
-		env->imbalance = busiest->group_misfit_task_load;
+		env->imbalance = busiest->group_misfit_task_util;
 		return;
 	}
 
@@ -8801,8 +8801,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
 			 * seek the "biggest" misfit task.
 			 */
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
+			if (rq->misfit_task_util > busiest_util) {
+				busiest_util = rq->misfit_task_util;
 				busiest = rq;
 			}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7583fad..ef6e1b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -916,7 +916,7 @@ struct rq {
 
 	unsigned char		idle_balance;
 
-	unsigned long		misfit_task_load;
+	unsigned long		misfit_task_util;
 
 	/* For active balancing */
 	int			active_balance;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 8/8] sched/fair: use utilization to select misfit task
  2019-08-01 14:40 ` [PATCH v2 8/8] sched/fair: use utilization to select misfit task Vincent Guittot
@ 2019-08-01 16:27   ` Valentin Schneider
  2019-08-02  8:29     ` Vincent Guittot
  2019-08-02 12:56   ` [PATCH v3] " Vincent Guittot
  1 sibling, 1 reply; 31+ messages in thread
From: Valentin Schneider @ 2019-08-01 16:27 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen

On 01/08/2019 15:40, Vincent Guittot wrote:
> @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		 * If we have more than one misfit sg go with the
>  		 * biggest misfit.
>  		 */
> -		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> +		if (sgs->group_misfit_task_util < busiest->group_misfit_task_util)
>  			return false;

I previously said this change would render the maximization useless, but I
had forgotten one thing: with PELT time scaling, task utilization can go
above its CPU's capacity.

So if you have two LITTLE CPUs running a busy loop (misfit task) each, the
one that's been running the longest would have the highest utilization
(assuming they haven't reached 1024 yet). In other words "utilizations
above the capacity_margin can't be compared" doesn't really stand.

Still, maximizing load would lead us there. Furthermore, if we have to pick
between two rqs with misfit tasks, I still believe we should pick the one
with the highest load, not the highest utilization.

We could keep load and fix the issue of detaching the wrong task with
something like:

-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53e64a7b2ae0..bfc2b624ee98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7489,12 +7489,8 @@ static int detach_tasks(struct lb_env *env)
                case migrate_misfit:
                        load = task_h_load(p);
 
-                       /*
-                        * utilization of misfit task might decrease a bit
-                        * since it has been recorded. Be conservative in the
-                        * condition.
-                        */
-                       if (load < env->imbalance)
+                       /* This is not a misfit task */
+                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))
                                goto next;
 
                        env->imbalance = 0;
----->8-----

However what would be *even* better IMO would be:

-----8<-----
@@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env)
                        return 1;
        }
 
+       /* XXX: make sure current is still a misfit task? */
        if (env->balance_type == migrate_misfit)
                return 1;
 
@@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,
        env.src_rq = busiest;
 
        ld_moved = 0;
+
+       /*
+        * Misfit tasks are only misfit if they are currently running, see
+        * update_misfit_status().
+        *
+        * - If they're not running, we'll get an opportunity at wakeup to
+        *   migrate them to a bigger CPU.
+        * - If they're running, we need to active balance them to a bigger CPU.
+        *
+        * Do the smart thing and go straight for active balance.
+        */
+       if (env->balance_type == migrate_misfit)
+               goto active_balance;
+
        if (busiest->nr_running > 1) {
                /*
                 * Attempt to move tasks. If find_busiest_group has found
@@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
                        goto out_all_pinned;
                }
        }
-
+active_balance:
        if (!ld_moved) {
                schedstat_inc(sd->lb_failed[idle]);
                /*
----->8-----

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 8/8] sched/fair: use utilization to select misfit task
  2019-08-01 16:27   ` Valentin Schneider
@ 2019-08-02  8:29     ` Vincent Guittot
  2019-08-02 10:49       ` Valentin Schneider
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2019-08-02  8:29 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On Thu, 1 Aug 2019 at 18:27, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 01/08/2019 15:40, Vincent Guittot wrote:
> > @@ -8261,7 +8261,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >                * If we have more than one misfit sg go with the
> >                * biggest misfit.
> >                */
> > -             if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> > +             if (sgs->group_misfit_task_util < busiest->group_misfit_task_util)
> >                       return false;
>
> I previously said this change would render the maximization useless, but I
> had forgotten one thing: with PELT time scaling, task utilization can go
> above its CPU's capacity.
>
> So if you have two LITTLE CPUs running a busy loop (misfit task) each, the
> one that's been running the longest would have the highest utilization
> (assuming they haven't reached 1024 yet). In other words "utilizations
> above the capacity_margin can't be compared" doesn't really stand.
>
> Still, maximizing load would lead us there. Furthermore, if we have to pick
> between two rqs with misfit tasks, I still believe we should pick the one
> with the highest load, not the highest utilization.
>
> We could keep load and fix the issue of detaching the wrong task with
> something like:
>
> -----8<-----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 53e64a7b2ae0..bfc2b624ee98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7489,12 +7489,8 @@ static int detach_tasks(struct lb_env *env)
>                 case migrate_misfit:
>                         load = task_h_load(p);
>
> -                       /*
> -                        * utilization of misfit task might decrease a bit
> -                        * since it has been recorded. Be conservative in the
> -                        * condition.
> -                        */
> -                       if (load < env->imbalance)
> +                       /* This is not a misfit task */
> +                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))
>                                 goto next;

This could be a solution for make sure to pull only misfit task and
keep using load

>
>                         env->imbalance = 0;
> ----->8-----
>
> However what would be *even* better IMO would be:
>
> -----8<-----
> @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env)
>                         return 1;
>         }
>
> +       /* XXX: make sure current is still a misfit task? */
>         if (env->balance_type == migrate_misfit)
>                 return 1;
>
> @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>         env.src_rq = busiest;
>
>         ld_moved = 0;
> +
> +       /*
> +        * Misfit tasks are only misfit if they are currently running, see
> +        * update_misfit_status().
> +        *
> +        * - If they're not running, we'll get an opportunity at wakeup to
> +        *   migrate them to a bigger CPU.
> +        * - If they're running, we need to active balance them to a bigger CPU.
> +        *
> +        * Do the smart thing and go straight for active balance.
> +        */
> +       if (env->balance_type == migrate_misfit)
> +               goto active_balance;
> +

This looks ugly and add a new bypass which this patchset tries to remove
This doesn't work if your misfit task has been preempted by another
one during the load balance and waiting for the runqueue


>         if (busiest->nr_running > 1) {
>                 /*
>                  * Attempt to move tasks. If find_busiest_group has found
> @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                         goto out_all_pinned;
>                 }
>         }
> -
> +active_balance:
>         if (!ld_moved) {
>                 schedstat_inc(sd->lb_failed[idle]);
>                 /*
> ----->8-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 8/8] sched/fair: use utilization to select misfit task
  2019-08-02  8:29     ` Vincent Guittot
@ 2019-08-02 10:49       ` Valentin Schneider
  0 siblings, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-02 10:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef

On 02/08/2019 09:29, Vincent Guittot wrote:
>> However what would be *even* better IMO would be:
>>
>> -----8<-----
>> @@ -8853,6 +8853,7 @@ voluntary_active_balance(struct lb_env *env)
>>                         return 1;
>>         }
>>
>> +       /* XXX: make sure current is still a misfit task? */
>>         if (env->balance_type == migrate_misfit)
>>                 return 1;
>>
>> @@ -8966,6 +8967,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>         env.src_rq = busiest;
>>
>>         ld_moved = 0;
>> +
>> +       /*
>> +        * Misfit tasks are only misfit if they are currently running, see
>> +        * update_misfit_status().
>> +        *
>> +        * - If they're not running, we'll get an opportunity at wakeup to
>> +        *   migrate them to a bigger CPU.
>> +        * - If they're running, we need to active balance them to a bigger CPU.
>> +        *
>> +        * Do the smart thing and go straight for active balance.
>> +        */
>> +       if (env->balance_type == migrate_misfit)
>> +               goto active_balance;
>> +
> 
> This looks ugly and add a new bypass which this patchset tries to remove
> This doesn't work if your misfit task has been preempted by another
> one during the load balance and waiting for the runqueue

I won't comment on aesthetics, but when it comes to preempted misfit tasks
do consider that a task *has* to have a utilization >= 80% of its CPU's
capacity to be flagged as misfit.
If it's a busy loop, it can only be preempted ~20% of the time to still be
flagged as a misfit task, so going straight for active balance will be the
right thing to do in the majority of cases.

What we gain from doing that is we save ourselves from (potentially
needlessly) iterating over the rq's tasks. That's less work and less rq
lock contention.

To put a bit of contrast on this, Qais did some profiling of usual mobile
workloads on a regular 4+4 big.LITTLE smartphone and IIRC the rq depth rose
very rarely above 5, although the tail did reach ~100 tasks. So most of the
time it would be fine to go through the detach_tasks() path.

This all deserves some actual benchmarking, I'll give it a shot.

>>         if (busiest->nr_running > 1) {
>>                 /*
>>                  * Attempt to move tasks. If find_busiest_group has found
>> @@ -9074,7 +9089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                         goto out_all_pinned;
>>                 }
>>         }
>> -
>> +active_balance:
>>         if (!ld_moved) {
>>                 schedstat_inc(sd->lb_failed[idle]);
>>                 /*
>> ----->8-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3] sched/fair: use utilization to select misfit task
  2019-08-01 14:40 ` [PATCH v2 8/8] sched/fair: use utilization to select misfit task Vincent Guittot
  2019-08-01 16:27   ` Valentin Schneider
@ 2019-08-02 12:56   ` Vincent Guittot
  2019-08-02 14:27     ` Valentin Schneider
  2019-08-05 11:01     ` Valentin Schneider
  1 sibling, 2 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-02 12:56 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, Vincent Guittot

utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
Keep tracking load instead of utilization but check that 
task doesn't fit CPU's capacity when selecting the task to detach as
suggested by Valentin 

 kernel/sched/fair.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53e64a7..8496118 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,15 +7487,9 @@ static int detach_tasks(struct lb_env *env)
 			break;
 
 		case migrate_misfit:
-			load = task_h_load(p);
-
-			/*
-			 * utilization of misfit task might decrease a bit
-			 * since it has been recorded. Be conservative in the
-			 * condition.
-			 */
-			if (load < env->imbalance)
-				goto next;
+			/* This is not a misfit task */
+                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))
+			       goto next;
 
 			env->imbalance = 0;
 			break;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] sched/fair: use utilization to select misfit task
  2019-08-02 12:56   ` [PATCH v3] " Vincent Guittot
@ 2019-08-02 14:27     ` Valentin Schneider
  2019-08-05 11:01     ` Valentin Schneider
  1 sibling, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-02 14:27 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, Thomas Gleixner

On 02/08/2019 13:56, Vincent Guittot wrote:
> utilization is used to detect a misfit task but the load is then used to
> select the task on the CPU which can lead to select a small task with
> high weight instead of the task that triggered the misfit migration.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> Keep tracking load instead of utilization but check that 
> task doesn't fit CPU's capacity when selecting the task to detach as
> suggested by Valentin 
> 

I find that one much better :) The very same fix could be applied
regardless of the rework, so FWIW (providing the changelog gets a bit
clearer - maybe squash the comment in it):
Acked-by: Valentin Schneider <valentin.schneider@arm.com>


One more thing though: we actually face the same problem in active balance,
i.e. the misfit task could sleep/get preempted before the CPU stopper
actually runs, and that latter would migrate $random.

I was thinking of passing the lb_env to stop_one_cpu_nowait() - we'd have
to rebuild it in active_load_balance_cpu_stop() anyway, but we could copy
things like env.migration_type so that we remember that we should be
picking a misfit task and have a similar detach guard.

Sadly, the lb_env struct is on load_balance()'s stack, and that's a nowait
call :/

Peter/Thomas, would there be much hate in adding some sort of flags field
to struct cpu_stop_work? Or if you see a better alternative, I'm all ears.


>  kernel/sched/fair.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 53e64a7..8496118 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7487,15 +7487,9 @@ static int detach_tasks(struct lb_env *env)
>  			break;
>  
>  		case migrate_misfit:
> -			load = task_h_load(p);
> -
> -			/*
> -			 * utilization of misfit task might decrease a bit
> -			 * since it has been recorded. Be conservative in the
> -			 * condition.
> -			 */
> -			if (load < env->imbalance)
> -				goto next;
> +			/* This is not a misfit task */
> +                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))
> +			       goto next;
>  
>  			env->imbalance = 0;
>  			break;
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] sched/fair: use utilization to select misfit task
  2019-08-02 12:56   ` [PATCH v3] " Vincent Guittot
  2019-08-02 14:27     ` Valentin Schneider
@ 2019-08-05 11:01     ` Valentin Schneider
  1 sibling, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-05 11:01 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen

On 02/08/2019 13:56, Vincent Guittot wrote:
> utilization is used to detect a misfit task but the load is then used to
> select the task on the CPU which can lead to select a small task with
> high weight instead of the task that triggered the misfit migration.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> Keep tracking load instead of utilization but check that 
> task doesn't fit CPU's capacity when selecting the task to detach as
> suggested by Valentin 
> 
>  kernel/sched/fair.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 53e64a7..8496118 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7487,15 +7487,9 @@ static int detach_tasks(struct lb_env *env)
>  			break;
>  
>  		case migrate_misfit:
> -			load = task_h_load(p);
> -
> -			/*
> -			 * utilization of misfit task might decrease a bit
> -			 * since it has been recorded. Be conservative in the
> -			 * condition.
> -			 */
> -			if (load < env->imbalance)
> -				goto next;
> +			/* This is not a misfit task */
> +                       if (task_fits_capacity(p, capacity_of(env->src_cpu)))
   ^^^^^^^^^^^^^^^^^^^^^^^
Just noticed those are spaces for some reason (I think my original diff is
to blame)

> +			       goto next;
>  
>  			env->imbalance = 0;
>  			break;
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
@ 2019-08-05 17:07   ` Valentin Schneider
  2019-08-26  9:26     ` Vincent Guittot
  2019-08-06 15:56   ` Peter Zijlstra
  2019-08-06 17:17   ` Valentin Schneider
  2 siblings, 1 reply; 31+ messages in thread
From: Valentin Schneider @ 2019-08-05 17:07 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen

Hi Vincent,

Here's another batch of comments, still need to go through some more of it.

On 01/08/2019 15:40, Vincent Guittot wrote:
> The load_balance algorithm contains some heuristics which have becomes

s/becomes/become/

> meaningless since the rework of metrics and the introduction of PELT.
                               ^^^^^^^^^^
Which metrics? I suppose you mean the s*_lb_stats structs?

> 
> Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> because everything is based on load whereas some imbalances are not
> related to the load.

Hmm, well, they don't start as wrong decisions, right? It's just that
certain imbalance scenarios can't be solved by looking only at load.

What about something along those lines?

"""
Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced.
"""

> The current algorithm ends up to create virtual and

s/to create/creating/

> meaningless value like the avg_load_per_task or tweaks the state of a
> group to make it overloaded whereas it's not, in order to try to migrate
> tasks.
> 
> load_balance should better qualify the imbalance of the group and define
> cleary what has to be moved to fix this imbalance.

s/define cleary/clearly define/

> 
> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
> 	group_has_spare
> 	group_fully_busy
> 	group_misfit_task
> 	group_asym_capacity
> 	group_imbalanced
> 	group_overloaded
> 
> Based on the type of sched_group, load_balance now sets what it wants to
> move in order to fix the imnbalance. It can be some load as before but

s/imnbalance/imbalance/

> also some utilization, a number of task or a type of task:
> 	migrate_task
> 	migrate_util
> 	migrate_load
> 	migrate_misfit
> 
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymetrics system

s/asymetrics/asymmetric/

> - the case of cfs task preempted by other class
> - the case of tasks not evenly spread on groups with spare capacity
> 
> The load balance decisions have been gathered in 3 functions:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imabalance between local and

s/imabalance/imbalance/

>   busiest group.
> - calculate_imbalance() decides what have to be moved.

That's nothing new, isn't it? I think what you mean there is that the
decisions have been consolidated in those areas, rather than being spread
all over the place.

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 581 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 379 insertions(+), 202 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7f4a7e..a8681c3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7136,13 +7136,28 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>  
>  enum fbq_type { regular, remote, all };
>  
> +/*
> + * group_type describes the group of CPUs at the moment of the load balance.
> + * The enum is ordered by pulling priority, with the group with lowest priority
> + * first so the groupe_type can be simply compared when selecting the busiest
> + * group. see update_sd_pick_busiest().
> + */
>  enum group_type {
> -	group_other = 0,
> +	group_has_spare = 0,
> +	group_fully_busy,
>  	group_misfit_task,
> +	group_asym_capacity,

That one got me confused - it's about asym packing, not asym capacity, and
the name should reflect that. I'd vote for group_asym_packing to stay in
line with what Quentin did for the sd shortcut pointers in

  011b27bb5d31 ("sched/topology: Add lowest CPU asymmetry sched_domain level pointer")

>  	group_imbalanced,
>  	group_overloaded,
>  };
>  
> +enum migration_type {
> +	migrate_task = 0,
> +	migrate_util,
> +	migrate_load,
> +	migrate_misfit,

nitpicking here: AFAICT the ordering of this doesn't really matter,
could we place migrate_misfit next to migrate_task? As you call out in the
header, we can migrate a number of tasks or a type of task, so these should
be somewhat together.

If we're afraid that we'll add other types of tasks later on and that this
won't result in a neat append-to-the-end, we could reverse the order like
so:

migrate_load
migrate_util
migrate_task
migrate_misfit

[...]
> @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
>  struct sd_lb_stats {
>  	struct sched_group *busiest;	/* Busiest group in this sd */
>  	struct sched_group *local;	/* Local group in this sd */
> -	unsigned long total_running;

Could be worth calling out in the log that this gets snipped out. Or it
could go into its own small cleanup patch, since it's just an unused field.

[...]> @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->group_type < busiest->group_type)
>  		return false;
>  
> -	if (sgs->avg_load <= busiest->avg_load)
> +	/*
> +	 * The candidate and the current busiest group are the same type of
> +	 * group. Let check which one is the busiest according to the type.
> +	 */
> +
> +	switch (sgs->group_type) {
> +	case group_overloaded:
> +		/* Select the overloaded group with highest avg_load. */
> +		if (sgs->avg_load <= busiest->avg_load)
> +			return false;
> +		break;
> +
> +	case group_imbalanced:
> +		/* Select the 1st imbalanced group as we don't have
> +		 * any way to choose one more than another
> +		 */
>  		return false;
> +		break;

You already have an unconditional return above.

>  
> -	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> -		goto asym_packing;
> +	case group_asym_capacity:
> +		/* Prefer to move from lowest priority CPU's work */
> +		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> +			 return false;
                        ^
		Stray whitespace

[...]
> @@ -8438,17 +8581,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	local = &sds.local_stat;
>  	busiest = &sds.busiest_stat;
>  
> -	/* ASYM feature bypasses nice load balance check */
> -	if (busiest->group_asym_capacity)
> -		goto force_balance;
> -
>  	/* There is no busy sibling group to pull tasks from */
>  	if (!sds.busiest || busiest->sum_h_nr_running == 0)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
That can go, since you now filter this in update_sd_pick_busiest()

[...]
> @@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  		goto force_balance;
>  
>  	/*
> -	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> -	 * capacities from resulting in underutilization due to avg_load.
> -	 */
> -	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> -	    busiest->group_no_capacity)
> -		goto force_balance;
> -
> -	/* Misfit tasks should be dealt with regardless of the avg load */
> -	if (busiest->group_type == group_misfit_task)
> -		goto force_balance;
> -
> -	/*
>  	 * If the local group is busier than the selected busiest group
>  	 * don't try and pull any tasks.
>  	 */
> -	if (local->avg_load >= busiest->avg_load)
> +	if (local->group_type > busiest->group_type)
>  		goto out_balanced;
>  
>  	/*
> -	 * Don't pull any tasks if this group is already above the domain
> -	 * average load.
> +	 * When groups are overloaded, use the avg_load to ensure fairness
> +	 * between tasks.
>  	 */
> -	if (local->avg_load >= sds.avg_load)
> -		goto out_balanced;
> +	if (local->group_type == group_overloaded) {
> +		/*
> +		 * If the local group is more loaded than the selected
> +		 * busiest group don't try and pull any tasks.
> +		 */
> +		if (local->avg_load >= busiest->avg_load)
> +			goto out_balanced;
> +
> +		/* XXX broken for overlapping NUMA groups */
> +		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
> +				sds.total_capacity;
>  
> -	if (env->idle == CPU_IDLE) {
>  		/*
> -		 * This CPU is idle. If the busiest group is not overloaded
> -		 * and there is no imbalance between this and busiest group
> -		 * wrt idle CPUs, it is balanced. The imbalance becomes
> -		 * significant if the diff is greater than 1 otherwise we
> -		 * might end up to just move the imbalance on another group
> +		 * Don't pull any tasks if this group is already above the
> +		 * domain average load.
>  		 */
> -		if ((busiest->group_type != group_overloaded) &&
> -				(local->idle_cpus <= (busiest->idle_cpus + 1)))
> +		if (local->avg_load >= sds.avg_load)
>  			goto out_balanced;
> -	} else {
> +
>  		/*
> -		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> -		 * imbalance_pct to be conservative.
> +		 * If the busiest group is more loaded, use imbalance_pct to be
> +		 * conservative.
>  		 */
>  		if (100 * busiest->avg_load <=
>  				env->sd->imbalance_pct * local->avg_load)
>  			goto out_balanced;
> +
>  	}
>  
> +	/* Try to move all excess tasks to child's sibling domain */
> +	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> +	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> +		goto force_balance;
> +
> +	if (busiest->group_type != group_overloaded &&
> +	     (env->idle == CPU_NOT_IDLE ||
> +	      local->idle_cpus <= (busiest->idle_cpus + 1)))
> +		/*
> +		 * If the busiest group is not overloaded
> +		 * and there is no imbalance between this and busiest group
> +		 * wrt idle CPUs, it is balanced. The imbalance
> +		 * becomes significant if the diff is greater than 1 otherwise
> +		 * we might end up to just move the imbalance on another
> +		 * group.
> +		 */
> +		goto out_balanced;
> +
>  force_balance:
>  	/* Looks like there is an imbalance. Compute it */
> -	env->src_grp_type = busiest->group_type;
>  	calculate_imbalance(env, &sds);
> +

Stray newline?

>  	return env->imbalance ? sds.busiest : NULL;
>  
>  out_balanced:
> +

Ditto?

>  	env->imbalance = 0;
>  	return NULL;
>  }
[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
  2019-08-05 17:07   ` Valentin Schneider
@ 2019-08-06 15:56   ` Peter Zijlstra
  2019-08-26  9:31     ` Vincent Guittot
  2019-08-06 17:17   ` Valentin Schneider
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-08-06 15:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen

On Thu, Aug 01, 2019 at 04:40:20PM +0200, Vincent Guittot wrote:
> The load_balance algorithm contains some heuristics which have becomes
> meaningless since the rework of metrics and the introduction of PELT.
> 
> Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> because everything is based on load whereas some imbalances are not
> related to the load. The current algorithm ends up to create virtual and
> meaningless value like the avg_load_per_task or tweaks the state of a
> group to make it overloaded whereas it's not, in order to try to migrate
> tasks.
> 
> load_balance should better qualify the imbalance of the group and define
> cleary what has to be moved to fix this imbalance.
> 
> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
> 	group_has_spare
> 	group_fully_busy
> 	group_misfit_task
> 	group_asym_capacity
> 	group_imbalanced
> 	group_overloaded
> 
> Based on the type of sched_group, load_balance now sets what it wants to
> move in order to fix the imnbalance. It can be some load as before but
> also some utilization, a number of task or a type of task:
> 	migrate_task
> 	migrate_util
> 	migrate_load
> 	migrate_misfit
> 
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymetrics system
> - the case of cfs task preempted by other class
> - the case of tasks not evenly spread on groups with spare capacity
> 
> The load balance decisions have been gathered in 3 functions:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imabalance between local and
>   busiest group.
> - calculate_imbalance() decides what have to be moved.

I like this lots, very good stuff.

It is weird how misfit is still load, but istr later patches cure that.

Here's a whole pile of nits, some overlap with what Valentin already
noted. His suggestions for the changelog also make sense to me.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8205,10 +8205,10 @@ static bool update_sd_pick_busiest(struc
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
-
 	/* Make sure that there is at least one task to pull */
 	if (!sgs->sum_h_nr_running)
 		return false;
+
 	/*
 	 * Don't try to pull misfit tasks we can't help.
 	 * We can use max_capacity here as reduction in capacity on some
@@ -8239,11 +8239,11 @@ static bool update_sd_pick_busiest(struc
 		break;
 
 	case group_imbalanced:
-		/* Select the 1st imbalanced group as we don't have
-		 * any way to choose one more than another
+		/*
+		 * Select the 1st imbalanced group as we don't have any way to
+		 * choose one more than another
 		 */
 		return false;
-		break;
 
 	case group_asym_capacity:
 		/* Prefer to move from lowest priority CPU's work */
@@ -8253,8 +8253,8 @@ static bool update_sd_pick_busiest(struc
 
 	case group_misfit_task:
 		/*
-		 * If we have more than one misfit sg go with the
-		 * biggest misfit.
+		 * If we have more than one misfit sg go with the biggest
+		 * misfit.
 		 */
 		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
 			return false;
@@ -8262,14 +8262,14 @@ static bool update_sd_pick_busiest(struc
 
 	case group_fully_busy:
 		/*
-		 * Select the fully busy group with highest avg_load.
-		 * In theory, there is no need to pull task from such
-		 * kind of group because tasks have all compute
-		 * capacity that they need but we can still improve the
-		 * overall throughput by reducing contention when
-		 * accessing shared HW resources.
-		 * XXX for now avg_load is not computed and always 0 so
-		 * we select the 1st one.
+		 * Select the fully busy group with highest avg_load.  In
+		 * theory, there is no need to pull task from such kind of
+		 * group because tasks have all compute capacity that they need
+		 * but we can still improve the overall throughput by reducing
+		 * contention when accessing shared HW resources.
+		 *
+		 * XXX for now avg_load is not computed and always 0 so we
+		 * select the 1st one.
 		 */
 		if (sgs->avg_load <= busiest->avg_load)
 			return false;
@@ -8277,11 +8277,11 @@ static bool update_sd_pick_busiest(struc
 
 	case group_has_spare:
 		/*
-		 * Select not overloaded group with lowest number of
-		 * idle cpus. We could also compare the spare capacity
-		 * which is more stable but it can end up that the
-		 * group has less spare capacity but finally more idle
-		 * cpus which means less opportunity to pull tasks.
+		 * Select not overloaded group with lowest number of idle cpus.
+		 * We could also compare the spare capacity which is more
+		 * stable but it can end up that the group has less spare
+		 * capacity but finally more idle cpus which means less
+		 * opportunity to pull tasks.
 		 */
 		if (sgs->idle_cpus >= busiest->idle_cpus)
 			return false;
@@ -8289,10 +8289,10 @@ static bool update_sd_pick_busiest(struc
 	}
 
 	/*
-	 * Candidate sg has no more than one task per CPU and
-	 * has higher per-CPU capacity. Migrating tasks to less
-	 * capable CPUs may harm throughput. Maximize throughput,
-	 * power/energy consequences are not considered.
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type <= group_fully_busy) &&
@@ -8428,6 +8428,7 @@ static inline void calculate_imbalance(s
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -8442,8 +8443,8 @@ static inline void calculate_imbalance(s
 
 	if (busiest->group_type == group_asym_capacity) {
 		/*
-		 * In case of asym capacity, we will try to migrate all load
-		 * to the preferred CPU
+		 * In case of asym capacity, we will try to migrate all load to
+		 * the preferred CPU
 		 */
 		env->balance_type = migrate_load;
 		env->imbalance = busiest->group_load;
@@ -8483,7 +8484,8 @@ static inline void calculate_imbalance(s
 			 * groups.
 			 */
 			env->balance_type = migrate_task;
-			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+			env->imbalance = (busiest->sum_h_nr_running -
+					  local->sum_h_nr_running) >> 1;
 			return;
 		}
 
@@ -8502,7 +8504,7 @@ static inline void calculate_imbalance(s
 	 */
 	if (local->group_type < group_overloaded) {
 		/*
-		 * Local will become overvloaded so the avg_load metrics are
+		 * Local will become overloaded so the avg_load metrics are
 		 * finally needed
 		 */
 
@@ -8514,12 +8516,12 @@ static inline void calculate_imbalance(s
 	}
 
 	/*
-	 * Both group are or will become overloaded and we're trying to get
-	 * all the CPUs to the average_load, so we don't want to push
-	 * ourselves above the average load, nor do we wish to reduce the
-	 * max loaded CPU below the average load. At the same time, we also
-	 * don't want to reduce the group load below the group capacity.
-	 * Thus we look for the minimum possible imbalance.
+	 * Both group are or will become overloaded and we're trying to get all
+	 * the CPUs to the average_load, so we don't want to push ourselves
+	 * above the average load, nor do we wish to reduce the max loaded CPU
+	 * below the average load. At the same time, we also don't want to
+	 * reduce the group load below the group capacity.  Thus we look for
+	 * the minimum possible imbalance.
 	 */
 	env->balance_type = migrate_load;
 	env->imbalance = min(
@@ -8641,7 +8643,6 @@ static struct sched_group *find_busiest_
 		if (100 * busiest->avg_load <=
 				env->sd->imbalance_pct * local->avg_load)
 			goto out_balanced;
-
 	}
 
 	/* Try to move all excess tasks to child's sibling domain */
@@ -8651,7 +8652,7 @@ static struct sched_group *find_busiest_
 
 	if (busiest->group_type != group_overloaded &&
 	     (env->idle == CPU_NOT_IDLE ||
-	      local->idle_cpus <= (busiest->idle_cpus + 1)))
+	      local->idle_cpus <= (busiest->idle_cpus + 1))) {
 		/*
 		 * If the busiest group is not overloaded
 		 * and there is no imbalance between this and busiest group
@@ -8661,15 +8662,14 @@ static struct sched_group *find_busiest_
 		 * group.
 		 */
 		goto out_balanced;
+	}
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
 	calculate_imbalance(env, &sds);
-
 	return env->imbalance ? sds.busiest : NULL;
 
 out_balanced:
-
 	env->imbalance = 0;
 	return NULL;
 }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/fair: use load instead of runnable load
  2019-08-01 14:40 ` [PATCH v2 6/8] sched/fair: use load instead of runnable load Vincent Guittot
@ 2019-08-06 16:07   ` Peter Zijlstra
  2019-08-26 15:45     ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-08-06 16:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen

On Thu, Aug 01, 2019 at 04:40:22PM +0200, Vincent Guittot wrote:
> runnable load has been introduced to take into account the case
> where blocked load biases the load balance decision which was selecting
> underutilized group with huge blocked load whereas other groups were
> overloaded.
> 
> The load is now only used when groups are overloaded. In this case,
> it's worth being conservative and taking into account the sleeping
> tasks that might wakeup on the cpu.

This one scares me a little. I have the feeling I'm missing/forgetting
something.

Also; while the regular load-balance (find-busiest) stuff is now all
aware of idle, this change also impacts wake_affine and find_idlest, and
they've not changed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
  2019-08-05 17:07   ` Valentin Schneider
  2019-08-06 15:56   ` Peter Zijlstra
@ 2019-08-06 17:17   ` Valentin Schneider
  2019-08-07 11:16     ` Valentin Schneider
  2019-08-26 10:11     ` Vincent Guittot
  2 siblings, 2 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-06 17:17 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen

Second batch, get it while it's hot...

On 01/08/2019 15:40, Vincent Guittot wrote:
[...]
> @@ -7438,19 +7453,53 @@ static int detach_tasks(struct lb_env *env)
>  		if (!can_migrate_task(p, env))
>  			goto next;
>  
> -		load = task_h_load(p);
> +		switch (env->balance_type) {
> +		case migrate_task:
> +			/* Migrate task */
> +			env->imbalance--;
> +			break;
>  
> -		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> -			goto next;
> +		case migrate_util:
> +			util = task_util_est(p);
>  
> -		if ((load / 2) > env->imbalance)
> -			goto next;
> +			if (util > env->imbalance)
> +				goto next;
> +
> +			env->imbalance -= util;
> +			break;
> +
> +		case migrate_load:
> +			load = task_h_load(p);
> +
> +			if (sched_feat(LB_MIN) &&
> +			    load < 16 && !env->sd->nr_balance_failed)
> +				goto next;
> +
> +			if ((load / 2) > env->imbalance)
> +				goto next;
> +
> +			env->imbalance -= load;
> +			break;
> +
> +		case migrate_misfit:
> +			load = task_h_load(p);
> +
> +			/*
> +			 * utilization of misfit task might decrease a bit
> +			 * since it has been recorded. Be conservative in the
> +			 * condition.
> +			 */
> +			if (load < env->imbalance)
> +				goto next;
> +
> +			env->imbalance = 0;
> +			break;
> +		}
>  
>  		detach_task(p, env);
>  		list_add(&p->se.group_node, &env->tasks);
>  
>  		detached++;
> -		env->imbalance -= load;
>  

It's missing something like this:

-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0631ff2a4bd..055563e19090 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7506,7 +7506,7 @@ static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * runnable load.
+		 * load/util/tasks
 		 */
 		if (env->imbalance <= 0)
 			break;
----->8-----

[...]
> @@ -8013,19 +8059,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  }
>  
>  static inline enum
> -group_type group_classify(struct sched_group *group,
> +group_type group_classify(struct lb_env *env,
> +			  struct sched_group *group,
>  			  struct sg_lb_stats *sgs)
>  {
> -	if (sgs->group_no_capacity)
> +	if (group_is_overloaded(env, sgs))
>  		return group_overloaded;
>  
>  	if (sg_imbalanced(group))
>  		return group_imbalanced;
>  
> +	if (sgs->group_asym_capacity)
> +		return group_asym_capacity;
> +
>  	if (sgs->group_misfit_task_load)
>  		return group_misfit_task;
>  
> -	return group_other;
> +	if (!group_has_capacity(env, sgs))
> +		return group_fully_busy;

If I got my booleans right, reaching group_fully_busy means
!group_is_overloaded() && !group_has_capacity(), which gives us:

- If nr_running > group_weight, then we had to have
  sgs->group_capacity * 100 == sgs->group_util * env->sd->imbalance_pct

- If nr_running == group_weight, then we had to have
  sgs->group_capacity * 100 <= sgs->group_util * env->sd->imbalance_pct

- (if nr_running < group_weight we go to group_has_spare)

That first equality seems somewhat odd considering how rarely it will
occur, but then we still want the util check to fall down to
group_has_spare when

  nr_running >= group_weight && 
  sgs->group_capacity * 100 > sgs->group_util * env->sd->imbalance_pct

Maybe we could change group_has_capacity()'s util comparison from '>' to
'>=' to avoid this weird "artefact".

> +
> +	return group_has_spare;
>  }
>  
>  static bool update_nohz_stats(struct rq *rq, bool force)

[...]
> @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->group_type < busiest->group_type)
>  		return false;
>  
> -	if (sgs->avg_load <= busiest->avg_load)
> +	/*
> +	 * The candidate and the current busiest group are the same type of
> +	 * group. Let check which one is the busiest according to the type.
> +	 */
> +
> +	switch (sgs->group_type) {
> +	case group_overloaded:
> +		/* Select the overloaded group with highest avg_load. */
> +		if (sgs->avg_load <= busiest->avg_load)
> +			return false;
> +		break;
> +
> +	case group_imbalanced:
> +		/* Select the 1st imbalanced group as we don't have
> +		 * any way to choose one more than another
> +		 */
>  		return false;
> +		break;
>  
> -	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> -		goto asym_packing;
> +	case group_asym_capacity:
> +		/* Prefer to move from lowest priority CPU's work */
> +		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> +			 return false;
> +		break;
> +
> +	case group_misfit_task:
> +		/*
> +		 * If we have more than one misfit sg go with the
> +		 * biggest misfit.
> +		 */
> +		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> +			return false;
> +		break;
> +
> +	case group_fully_busy:
> +		/*
> +		 * Select the fully busy group with highest avg_load.
> +		 * In theory, there is no need to pull task from such
> +		 * kind of group because tasks have all compute
> +		 * capacity that they need but we can still improve the
> +		 * overall throughput by reducing contention when
> +		 * accessing shared HW resources.
> +		 * XXX for now avg_load is not computed and always 0 so
> +		 * we select the 1st one.
> +		 */

What's wrong with unconditionally computing avg_load in update_sg_lb_stats()?
We already unconditionally accumulate group_load anyway.

If it's to make way for patch 6/8 (using load instead of runnable load),
then I think you are doing things in the wrong order. IMO in this patch we
should unconditionally compute avg_load (using runnable load), and then
you could tweak it up in a subsequent patch.

> +		if (sgs->avg_load <= busiest->avg_load)
> +			return false;
> +		break;
> +
> +	case group_has_spare:
> +		/*
> +		 * Select not overloaded group with lowest number of
> +		 * idle cpus. We could also compare the spare capacity
> +		 * which is more stable but it can end up that the
> +		 * group has less spare capacity but finally more idle
> +		 * cpus which means less opportunity to pull tasks.
> +		 */
> +		if (sgs->idle_cpus >= busiest->idle_cpus)
> +			return false;
> +		break;
> +	}
>  
>  	/*
>  	 * Candidate sg has no more than one task per CPU and

[...]
> @@ -8341,69 +8421,132 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   */
>  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
>  {
> -	unsigned long max_pull, load_above_capacity = ~0UL;
>  	struct sg_lb_stats *local, *busiest;
>  
>  	local = &sds->local_stat;
>  	busiest = &sds->busiest_stat;
> +	if (busiest->group_type == group_imbalanced) {
> +		/*
> +		 * In the group_imb case we cannot rely on group-wide averages
> +		 * to ensure CPU-load equilibrium, try to move any task to fix
> +		 * the imbalance. The next load balance will take care of
> +		 * balancing back the system.
> +		 */
> +		env->balance_type = migrate_task;
> +		env->imbalance = 1;
> +		return;
> +	}
>  
> -	if (busiest->group_asym_capacity) {
> +	if (busiest->group_type == group_asym_capacity) {
> +		/*
> +		 * In case of asym capacity, we will try to migrate all load
> +		 * to the preferred CPU
> +		 */
> +		env->balance_type = migrate_load;
>  		env->imbalance = busiest->group_load;
>  		return;
>  	}
>  
> +	if (busiest->group_type == group_misfit_task) {
> +		/* Set imbalance to allow misfit task to be balanced. */
> +		env->balance_type = migrate_misfit;
> +		env->imbalance = busiest->group_misfit_task_load;

AFAICT we don't ever use this value, other than setting it to 0 in
detach_tasks(), so what we actually set it to doesn't matter (as long as
it's > 0).

I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
though it makes some other places somewhat uglier. The good thing is that
it actually does end up being implemented as a special kind of task
migration, rather than being loosely related.

Below's a diff, only compile-tested but should help show my point.
-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8681c32445b..b0631ff2a4bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7155,7 +7155,6 @@ enum migration_type {
 	migrate_task = 0,
 	migrate_util,
 	migrate_load,
-	migrate_misfit,
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -7188,6 +7187,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
+	enum group_type         src_grp_type;
 	enum migration_type	balance_type;
 	struct list_head	tasks;
 };
@@ -7455,6 +7455,13 @@ static int detach_tasks(struct lb_env *env)
 
 		switch (env->balance_type) {
 		case migrate_task:
+			/* Check for special kinds of tasks */
+			if (env->src_grp_type == group_misfit_task) {
+				/* This isn't the misfit task */
+				if (task_fits_capacity(p, capacity_of(env->src_cpu)))
+					goto next;
+			}
+
 			/* Migrate task */
 			env->imbalance--;
 			break;
@@ -7480,20 +7487,6 @@ static int detach_tasks(struct lb_env *env)
 
 			env->imbalance -= load;
 			break;
-
-		case migrate_misfit:
-			load = task_h_load(p);
-
-			/*
-			 * utilization of misfit task might decrease a bit
-			 * since it has been recorded. Be conservative in the
-			 * condition.
-			 */
-			if (load < env->imbalance)
-				goto next;
-
-			env->imbalance = 0;
-			break;
 		}
 
 		detach_task(p, env);
@@ -8449,8 +8442,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 
 	if (busiest->group_type == group_misfit_task) {
 		/* Set imbalance to allow misfit task to be balanced. */
-		env->balance_type = migrate_misfit;
-		env->imbalance = busiest->group_misfit_task_load;
+		env->balance_type = migrate_task;
+		env->imbalance = 1;
 		return;
 	}
 
@@ -8661,8 +8654,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
+	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
-
 	return env->imbalance ? sds.busiest : NULL;
 
 out_balanced:
@@ -8728,7 +8721,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 		switch (env->balance_type) {
 		case migrate_task:
-			if (busiest_nr < nr_running) {
+			/*
+			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+			 * seek the "biggest" misfit task.
+			 */
+			if (env->src_grp_type == group_misfit_task &&
+			    rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			} else if (busiest_nr < nr_running) {
 				busiest_nr = nr_running;
 				busiest = rq;
 			}
@@ -8771,19 +8772,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 				busiest = rq;
 			}
 			break;
-
-		case migrate_misfit:
-			/*
-			 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
-			 * seek the "biggest" misfit task.
-			 */
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
-				busiest = rq;
-			}
-
-			break;
-
 		}
 	}
 
@@ -8829,7 +8817,7 @@ voluntary_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->balance_type == migrate_misfit)
+	if (env->src_grp_type == group_misfit_task)
 		return 1;
 
 	return 0;
----->8-----


> +		return;
> +	}
> +

Also, I think these three busiest->group_type conditions could be turned
into a switch case.

>  	/*
> -	 * Avg load of busiest sg can be less and avg load of local sg can
> -	 * be greater than avg load across all sgs of sd because avg load
> -	 * factors in sg capacity and sgs with smaller group_type are
> -	 * skipped when updating the busiest sg:
> +	 * Try to use spare capacity of local group without overloading it or
> +	 * emptying busiest
>  	 */
> -	if (busiest->group_type != group_misfit_task &&
> -	    (busiest->avg_load <= sds->avg_load ||
> -	     local->avg_load >= sds->avg_load)) {
> -		env->imbalance = 0;
> +	if (local->group_type == group_has_spare) {
> +		if (busiest->group_type > group_fully_busy) {
> +			/*
> +			 * If busiest is overloaded, try to fill spare
> +			 * capacity. This might end up creating spare capacity
> +			 * in busiest or busiest still being overloaded but
> +			 * there is no simple way to directly compute the
> +			 * amount of load to migrate in order to balance the
> +			 * system.
> +			 */
> +			env->balance_type = migrate_util;
> +			env->imbalance = max(local->group_capacity, local->group_util) -
> +				    local->group_util;
> +			return;
> +		}
> +
> +		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> +			/*
> +			 * When prefer sibling, evenly spread running tasks on
> +			 * groups.
> +			 */
> +			env->balance_type = migrate_task;
> +			env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> +			return;
> +		}
> +
> +		/*
> +		 * If there is no overload, we just want to even the number of
> +		 * idle cpus.
> +		 */
> +		env->balance_type = migrate_task;
> +		env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>  		return;
>  	}
>  
>  	/*
> -	 * If there aren't any idle CPUs, avoid creating some.
> +	 * Local is fully busy but have to take more load to relieve the
> +	 * busiest group
>  	 */
> -	if (busiest->group_type == group_overloaded &&
> -	    local->group_type   == group_overloaded) {
> -		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> -		if (load_above_capacity > busiest->group_capacity) {
> -			load_above_capacity -= busiest->group_capacity;
> -			load_above_capacity *= scale_load_down(NICE_0_LOAD);
> -			load_above_capacity /= busiest->group_capacity;
> -		} else
> -			load_above_capacity = ~0UL;
> +	if (local->group_type < group_overloaded) {
> +		/*
> +		 * Local will become overvloaded so the avg_load metrics are
                                     ^^^^^^^^^^^
                            s/overvloaded/overloaded/

> +		 * finally needed
> +		 */
> +
> +		local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
> +				  local->group_capacity;
> +
> +		sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
> +				sds->total_capacity;
>  	}
>  
>  	/*
> -	 * We're trying to get all the CPUs to the average_load, so we don't
> -	 * want to push ourselves above the average load, nor do we wish to
> -	 * reduce the max loaded CPU below the average load. At the same time,
> -	 * we also don't want to reduce the group load below the group
> -	 * capacity. Thus we look for the minimum possible imbalance.
> +	 * Both group are or will become overloaded and we're trying to get
> +	 * all the CPUs to the average_load, so we don't want to push
> +	 * ourselves above the average load, nor do we wish to reduce the
> +	 * max loaded CPU below the average load. At the same time, we also
> +	 * don't want to reduce the group load below the group capacity.
> +	 * Thus we look for the minimum possible imbalance.
>  	 */
> -	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
> -
> -	/* How much load to actually move to equalise the imbalance */
> +	env->balance_type = migrate_load;
>  	env->imbalance = min(
> -		max_pull * busiest->group_capacity,
> +		(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
>  		(sds->avg_load - local->avg_load) * local->group_capacity
>  	) / SCHED_CAPACITY_SCALE;
> -
> -	/* Boost imbalance to allow misfit task to be balanced. */
> -	if (busiest->group_type == group_misfit_task) {
> -		env->imbalance = max_t(long, env->imbalance,
> -				       busiest->group_misfit_task_load);
> -	}
> -
>  }
>  
>  /******* find_busiest_group() helpers end here *********************/
>  
> +/*
> + * Decision matrix according to the local and busiest group state
> + *
> + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> + * misfit_task      force     N/A        N/A    N/A  force      force
> + * asym_capacity    force     force      N/A    N/A  force      force
> + * imbalanced       force     force      N/A    N/A  force      force
> + * overloaded       force     force      N/A    N/A  force      avg_load
> + *
> + * N/A :      Not Applicable because already filtered while updating
> + *            statistics.
> + * balanced : The system is balanced for these 2 groups.
> + * force :    Calculate the imbalance as load migration is probably needed.
> + * avg_load : Only if imbalance is significant enough.
> + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
> + *            different in groups.
> + */
> +
>  /**
>   * find_busiest_group - Returns the busiest group within the sched_domain
>   * if there is an imbalance.

[...]
> @@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  		goto force_balance;
>  

As for their counterpart in calculate_imbalance(), maybe go for a switch?

Also, it would be nice if the ordering of these matched the one in
calculate_imbalance() - right now it's the exact reverse order.

Finally, would it hurt much to just move the balance_type and imbalance
computation for these group types here? It breaks the nice partitions
you've set up between find_busiest_group() and calculate_imbalance(),
but it leads to less redirections for what in the end is just a simple
condition.

>  	/*
> -	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> -	 * capacities from resulting in underutilization due to avg_load.
> -	 */
> -	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> -	    busiest->group_no_capacity)
> -		goto force_balance;
> -
> -	/* Misfit tasks should be dealt with regardless of the avg load */
> -	if (busiest->group_type == group_misfit_task)
> -		goto force_balance;
> -
> -	/*
>  	 * If the local group is busier than the selected busiest group
>  	 * don't try and pull any tasks.
>  	 */
> -	if (local->avg_load >= busiest->avg_load)
> +	if (local->group_type > busiest->group_type)
>  		goto out_balanced;
>  
>  	/*
> -	 * Don't pull any tasks if this group is already above the domain
> -	 * average load.
> +	 * When groups are overloaded, use the avg_load to ensure fairness
> +	 * between tasks.
>  	 */
> -	if (local->avg_load >= sds.avg_load)
> -		goto out_balanced;
> +	if (local->group_type == group_overloaded) {
> +		/*
> +		 * If the local group is more loaded than the selected
> +		 * busiest group don't try and pull any tasks.
> +		 */
> +		if (local->avg_load >= busiest->avg_load)
> +			goto out_balanced;
> +
> +		/* XXX broken for overlapping NUMA groups */
> +		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
> +				sds.total_capacity;
>  
> -	if (env->idle == CPU_IDLE) {
>  		/*
> -		 * This CPU is idle. If the busiest group is not overloaded
> -		 * and there is no imbalance between this and busiest group
> -		 * wrt idle CPUs, it is balanced. The imbalance becomes
> -		 * significant if the diff is greater than 1 otherwise we
> -		 * might end up to just move the imbalance on another group
> +		 * Don't pull any tasks if this group is already above the
> +		 * domain average load.
>  		 */
> -		if ((busiest->group_type != group_overloaded) &&
> -				(local->idle_cpus <= (busiest->idle_cpus + 1)))
> +		if (local->avg_load >= sds.avg_load)
>  			goto out_balanced;
> -	} else {
> +
>  		/*
> -		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> -		 * imbalance_pct to be conservative.
> +		 * If the busiest group is more loaded, use imbalance_pct to be
> +		 * conservative.
>  		 */
>  		if (100 * busiest->avg_load <=
>  				env->sd->imbalance_pct * local->avg_load)
>  			goto out_balanced;
> +
>  	}
>  
> +	/* Try to move all excess tasks to child's sibling domain */
> +	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> +	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> +		goto force_balance;
> +
> +	if (busiest->group_type != group_overloaded &&
> +	     (env->idle == CPU_NOT_IDLE ||

Disregarding the idle_cpus count, we never balance load when the busiest
is < group_misfit_task and we're CPU_NOT_IDLE.

I *think* that's okay, since AFAICT that should mean

  (local)   nr_running < group_weight
  (busiest) nr_running <= group_weight
            (or that weird == case)

and if we (local) are not idle then we shouldn't pull anything. Bleh, guess
it made me scratch my head for nothing.

> +	      local->idle_cpus <= (busiest->idle_cpus + 1)))
> +		/*
> +		 * If the busiest group is not overloaded
> +		 * and there is no imbalance between this and busiest group
> +		 * wrt idle CPUs, it is balanced. The imbalance
> +		 * becomes significant if the diff is greater than 1 otherwise
> +		 * we might end up to just move the imbalance on another
> +		 * group.
> +		 */
> +		goto out_balanced;
> +
>  force_balance:
>  	/* Looks like there is an imbalance. Compute it */
> -	env->src_grp_type = busiest->group_type;
>  	calculate_imbalance(env, &sds);
> +
>  	return env->imbalance ? sds.busiest : NULL;
>  
>  out_balanced:
> +
>  	env->imbalance = 0;
>  	return NULL;
>  }

[...]
> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  	env.src_rq = busiest;
>  
>  	ld_moved = 0;
> -	if (busiest->cfs.h_nr_running > 1) {
> +	if (busiest->nr_running > 1) {

Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
tasks.

>  		/*
>  		 * Attempt to move tasks. If find_busiest_group has found
>  		 * an imbalance but busiest->nr_running <= 1, the group is
> 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-06 17:17   ` Valentin Schneider
@ 2019-08-07 11:16     ` Valentin Schneider
  2019-08-26 10:11     ` Vincent Guittot
  1 sibling, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-07 11:16 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel, mingo, peterz
  Cc: pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen

On 06/08/2019 18:17, Valentin Schneider wrote:
>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>  	env.src_rq = busiest;
>>  
>>  	ld_moved = 0;
>> -	if (busiest->cfs.h_nr_running > 1) {
>> +	if (busiest->nr_running > 1) {
> 
> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
> tasks.
> 

Wait, so that seems to be a correction of an over-zealous rename in patch
2/8, but I think we actually *do* want it to be a cfs.h_nr_running check
here.

And actually this made me have a think about our active balance checks,
I'm cooking something up in that regards.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-05 17:07   ` Valentin Schneider
@ 2019-08-26  9:26     ` Vincent Guittot
  2019-08-28 10:25       ` Valentin Schneider
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2019-08-26  9:26 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On Mon, 5 Aug 2019 at 19:07, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Hi Vincent,
>
> Here's another batch of comments, still need to go through some more of it.
>
> On 01/08/2019 15:40, Vincent Guittot wrote:
> > The load_balance algorithm contains some heuristics which have becomes
>
> s/becomes/become/

yep for this one and other typo mistakes

>
> > meaningless since the rework of metrics and the introduction of PELT.
>                                ^^^^^^^^^^
> Which metrics? I suppose you mean the s*_lb_stats structs?
>
> >
> > Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> > because everything is based on load whereas some imbalances are not
> > related to the load.
>
> Hmm, well, they don't start as wrong decisions, right? It's just that
> certain imbalance scenarios can't be solved by looking only at load.
>
> What about something along those lines?
>
> """
> Furthermore, load is an ill-suited metric for solving certain task
> placement imbalance scenarios. For instance, in the presence of idle CPUs,
> we should simply try to get at least one task per CPU, whereas the current
> load-based algorithm can actually leave idle CPUs alone simply because the
> load is somewhat balanced.
> """
>
> > The current algorithm ends up to create virtual and
>
> s/to create/creating/
>
> > meaningless value like the avg_load_per_task or tweaks the state of a
> > group to make it overloaded whereas it's not, in order to try to migrate
> > tasks.
> >
> > load_balance should better qualify the imbalance of the group and define
> > cleary what has to be moved to fix this imbalance.
>
> s/define cleary/clearly define/
>
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> >       group_has_spare
> >       group_fully_busy
> >       group_misfit_task
> >       group_asym_capacity
> >       group_imbalanced
> >       group_overloaded
> >
> > Based on the type of sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
>
> s/imnbalance/imbalance/
>
> > also some utilization, a number of task or a type of task:
> >       migrate_task
> >       migrate_util
> >       migrate_load
> >       migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymetrics system
>
> s/asymetrics/asymmetric/
>
> > - the case of cfs task preempted by other class
> > - the case of tasks not evenly spread on groups with spare capacity
> >
> > The load balance decisions have been gathered in 3 functions:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imabalance between local and
>
> s/imabalance/imbalance/
>
> >   busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> That's nothing new, isn't it? I think what you mean there is that the

There is 2 things:
-part of the algorithm is new and fixes wrong task placement
-everything has been consolidated in the 3 functions above whereas
there were some bypasses and hack in the current code

> decisions have been consolidated in those areas, rather than being spread

I would not say that the code was spread all over the place because
90% was already correctly placed but there were few cases that have
been added outside these functions

> all over the place.
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 581 ++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 379 insertions(+), 202 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d7f4a7e..a8681c3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7136,13 +7136,28 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> >
> >  enum fbq_type { regular, remote, all };
> >
> > +/*
> > + * group_type describes the group of CPUs at the moment of the load balance.
> > + * The enum is ordered by pulling priority, with the group with lowest priority
> > + * first so the groupe_type can be simply compared when selecting the busiest
> > + * group. see update_sd_pick_busiest().
> > + */
> >  enum group_type {
> > -     group_other = 0,
> > +     group_has_spare = 0,
> > +     group_fully_busy,
> >       group_misfit_task,
> > +     group_asym_capacity,
>
> That one got me confused - it's about asym packing, not asym capacity, and
> the name should reflect that. I'd vote for group_asym_packing to stay in
> line with what Quentin did for the sd shortcut pointers in

yep asym_packing is probably better

>
>   011b27bb5d31 ("sched/topology: Add lowest CPU asymmetry sched_domain level pointer")
>
> >       group_imbalanced,
> >       group_overloaded,
> >  };
> >
> > +enum migration_type {
> > +     migrate_task = 0,
> > +     migrate_util,
> > +     migrate_load,
> > +     migrate_misfit,
>
> nitpicking here: AFAICT the ordering of this doesn't really matter,
> could we place migrate_misfit next to migrate_task? As you call out in the
> header, we can migrate a number of tasks or a type of task, so these should
> be somewhat together.

misfit has been added last because it's specific whereas others are
somehow generic and I want to keep generic first and specific last

>
> If we're afraid that we'll add other types of tasks later on and that this
> won't result in a neat append-to-the-end, we could reverse the order like
> so:
>
> migrate_load
> migrate_util
> migrate_task
> migrate_misfit

I can put in this order

>
> [...]
> > @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
> >  struct sd_lb_stats {
> >       struct sched_group *busiest;    /* Busiest group in this sd */
> >       struct sched_group *local;      /* Local group in this sd */
> > -     unsigned long total_running;
>
> Could be worth calling out in the log that this gets snipped out. Or it
> could go into its own small cleanup patch, since it's just an unused field.

I can mention it more specifically in the log but that's part of those
meaningless metrics which is no more used
>
> [...]> @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >       if (sgs->group_type < busiest->group_type)
> >               return false;
> >
> > -     if (sgs->avg_load <= busiest->avg_load)
> > +     /*
> > +      * The candidate and the current busiest group are the same type of
> > +      * group. Let check which one is the busiest according to the type.
> > +      */
> > +
> > +     switch (sgs->group_type) {
> > +     case group_overloaded:
> > +             /* Select the overloaded group with highest avg_load. */
> > +             if (sgs->avg_load <= busiest->avg_load)
> > +                     return false;
> > +             break;
> > +
> > +     case group_imbalanced:
> > +             /* Select the 1st imbalanced group as we don't have
> > +              * any way to choose one more than another
> > +              */
> >               return false;
> > +             break;
>
> You already have an unconditional return above.

good point

>
> >
> > -     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> > -             goto asym_packing;
> > +     case group_asym_capacity:
> > +             /* Prefer to move from lowest priority CPU's work */
> > +             if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > +                      return false;
>                         ^
>                 Stray whitespace
>
> [...]
> > @@ -8438,17 +8581,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >       local = &sds.local_stat;
> >       busiest = &sds.busiest_stat;
> >
> > -     /* ASYM feature bypasses nice load balance check */
> > -     if (busiest->group_asym_capacity)
> > -             goto force_balance;
> > -
> >       /* There is no busy sibling group to pull tasks from */
> >       if (!sds.busiest || busiest->sum_h_nr_running == 0)
>                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> That can go, since you now filter this in update_sd_pick_busiest()

yes

>
> [...]
> > @@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >               goto force_balance;
> >
> >       /*
> > -      * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> > -      * capacities from resulting in underutilization due to avg_load.
> > -      */
> > -     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> > -         busiest->group_no_capacity)
> > -             goto force_balance;
> > -
> > -     /* Misfit tasks should be dealt with regardless of the avg load */
> > -     if (busiest->group_type == group_misfit_task)
> > -             goto force_balance;
> > -
> > -     /*
> >        * If the local group is busier than the selected busiest group
> >        * don't try and pull any tasks.
> >        */
> > -     if (local->avg_load >= busiest->avg_load)
> > +     if (local->group_type > busiest->group_type)
> >               goto out_balanced;
> >
> >       /*
> > -      * Don't pull any tasks if this group is already above the domain
> > -      * average load.
> > +      * When groups are overloaded, use the avg_load to ensure fairness
> > +      * between tasks.
> >        */
> > -     if (local->avg_load >= sds.avg_load)
> > -             goto out_balanced;
> > +     if (local->group_type == group_overloaded) {
> > +             /*
> > +              * If the local group is more loaded than the selected
> > +              * busiest group don't try and pull any tasks.
> > +              */
> > +             if (local->avg_load >= busiest->avg_load)
> > +                     goto out_balanced;
> > +
> > +             /* XXX broken for overlapping NUMA groups */
> > +             sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
> > +                             sds.total_capacity;
> >
> > -     if (env->idle == CPU_IDLE) {
> >               /*
> > -              * This CPU is idle. If the busiest group is not overloaded
> > -              * and there is no imbalance between this and busiest group
> > -              * wrt idle CPUs, it is balanced. The imbalance becomes
> > -              * significant if the diff is greater than 1 otherwise we
> > -              * might end up to just move the imbalance on another group
> > +              * Don't pull any tasks if this group is already above the
> > +              * domain average load.
> >                */
> > -             if ((busiest->group_type != group_overloaded) &&
> > -                             (local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +             if (local->avg_load >= sds.avg_load)
> >                       goto out_balanced;
> > -     } else {
> > +
> >               /*
> > -              * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> > -              * imbalance_pct to be conservative.
> > +              * If the busiest group is more loaded, use imbalance_pct to be
> > +              * conservative.
> >                */
> >               if (100 * busiest->avg_load <=
> >                               env->sd->imbalance_pct * local->avg_load)
> >                       goto out_balanced;
> > +
> >       }
> >
> > +     /* Try to move all excess tasks to child's sibling domain */
> > +     if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > +         busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> > +             goto force_balance;
> > +
> > +     if (busiest->group_type != group_overloaded &&
> > +          (env->idle == CPU_NOT_IDLE ||
> > +           local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +             /*
> > +              * If the busiest group is not overloaded
> > +              * and there is no imbalance between this and busiest group
> > +              * wrt idle CPUs, it is balanced. The imbalance
> > +              * becomes significant if the diff is greater than 1 otherwise
> > +              * we might end up to just move the imbalance on another
> > +              * group.
> > +              */
> > +             goto out_balanced;
> > +
> >  force_balance:
> >       /* Looks like there is an imbalance. Compute it */
> > -     env->src_grp_type = busiest->group_type;
> >       calculate_imbalance(env, &sds);
> > +
>
> Stray newline?
>
> >       return env->imbalance ? sds.busiest : NULL;
> >
> >  out_balanced:
> > +
>
> Ditto?
>
> >       env->imbalance = 0;
> >       return NULL;
> >  }
> [...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-06 15:56   ` Peter Zijlstra
@ 2019-08-26  9:31     ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-26  9:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On Tue, 6 Aug 2019 at 17:56, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 01, 2019 at 04:40:20PM +0200, Vincent Guittot wrote:
> > The load_balance algorithm contains some heuristics which have becomes
> > meaningless since the rework of metrics and the introduction of PELT.
> >
> > Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> > because everything is based on load whereas some imbalances are not
> > related to the load. The current algorithm ends up to create virtual and
> > meaningless value like the avg_load_per_task or tweaks the state of a
> > group to make it overloaded whereas it's not, in order to try to migrate
> > tasks.
> >
> > load_balance should better qualify the imbalance of the group and define
> > cleary what has to be moved to fix this imbalance.
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> >       group_has_spare
> >       group_fully_busy
> >       group_misfit_task
> >       group_asym_capacity
> >       group_imbalanced
> >       group_overloaded
> >
> > Based on the type of sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
> > also some utilization, a number of task or a type of task:
> >       migrate_task
> >       migrate_util
> >       migrate_load
> >       migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymetrics system
> > - the case of cfs task preempted by other class
> > - the case of tasks not evenly spread on groups with spare capacity
> >
> > The load balance decisions have been gathered in 3 functions:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imabalance between local and
> >   busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> I like this lots, very good stuff.
>
> It is weird how misfit is still load, but istr later patches cure that.
>
> Here's a whole pile of nits, some overlap with what Valentin already
> noted. His suggestions for the changelog also make sense to me.

ok.
Thanks

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8205,10 +8205,10 @@ static bool update_sd_pick_busiest(struc
>  {
>         struct sg_lb_stats *busiest = &sds->busiest_stat;
>
> -
>         /* Make sure that there is at least one task to pull */
>         if (!sgs->sum_h_nr_running)
>                 return false;
> +
>         /*
>          * Don't try to pull misfit tasks we can't help.
>          * We can use max_capacity here as reduction in capacity on some
> @@ -8239,11 +8239,11 @@ static bool update_sd_pick_busiest(struc
>                 break;
>
>         case group_imbalanced:
> -               /* Select the 1st imbalanced group as we don't have
> -                * any way to choose one more than another
> +               /*
> +                * Select the 1st imbalanced group as we don't have any way to
> +                * choose one more than another
>                  */
>                 return false;
> -               break;
>
>         case group_asym_capacity:
>                 /* Prefer to move from lowest priority CPU's work */
> @@ -8253,8 +8253,8 @@ static bool update_sd_pick_busiest(struc
>
>         case group_misfit_task:
>                 /*
> -                * If we have more than one misfit sg go with the
> -                * biggest misfit.
> +                * If we have more than one misfit sg go with the biggest
> +                * misfit.
>                  */
>                 if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
>                         return false;
> @@ -8262,14 +8262,14 @@ static bool update_sd_pick_busiest(struc
>
>         case group_fully_busy:
>                 /*
> -                * Select the fully busy group with highest avg_load.
> -                * In theory, there is no need to pull task from such
> -                * kind of group because tasks have all compute
> -                * capacity that they need but we can still improve the
> -                * overall throughput by reducing contention when
> -                * accessing shared HW resources.
> -                * XXX for now avg_load is not computed and always 0 so
> -                * we select the 1st one.
> +                * Select the fully busy group with highest avg_load.  In
> +                * theory, there is no need to pull task from such kind of
> +                * group because tasks have all compute capacity that they need
> +                * but we can still improve the overall throughput by reducing
> +                * contention when accessing shared HW resources.
> +                *
> +                * XXX for now avg_load is not computed and always 0 so we
> +                * select the 1st one.
>                  */
>                 if (sgs->avg_load <= busiest->avg_load)
>                         return false;
> @@ -8277,11 +8277,11 @@ static bool update_sd_pick_busiest(struc
>
>         case group_has_spare:
>                 /*
> -                * Select not overloaded group with lowest number of
> -                * idle cpus. We could also compare the spare capacity
> -                * which is more stable but it can end up that the
> -                * group has less spare capacity but finally more idle
> -                * cpus which means less opportunity to pull tasks.
> +                * Select not overloaded group with lowest number of idle cpus.
> +                * We could also compare the spare capacity which is more
> +                * stable but it can end up that the group has less spare
> +                * capacity but finally more idle cpus which means less
> +                * opportunity to pull tasks.
>                  */
>                 if (sgs->idle_cpus >= busiest->idle_cpus)
>                         return false;
> @@ -8289,10 +8289,10 @@ static bool update_sd_pick_busiest(struc
>         }
>
>         /*
> -        * Candidate sg has no more than one task per CPU and
> -        * has higher per-CPU capacity. Migrating tasks to less
> -        * capable CPUs may harm throughput. Maximize throughput,
> -        * power/energy consequences are not considered.
> +        * Candidate sg has no more than one task per CPU and has higher
> +        * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> +        * throughput. Maximize throughput, power/energy consequences are not
> +        * considered.
>          */
>         if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
>             (sgs->group_type <= group_fully_busy) &&
> @@ -8428,6 +8428,7 @@ static inline void calculate_imbalance(s
>
>         local = &sds->local_stat;
>         busiest = &sds->busiest_stat;
> +
>         if (busiest->group_type == group_imbalanced) {
>                 /*
>                  * In the group_imb case we cannot rely on group-wide averages
> @@ -8442,8 +8443,8 @@ static inline void calculate_imbalance(s
>
>         if (busiest->group_type == group_asym_capacity) {
>                 /*
> -                * In case of asym capacity, we will try to migrate all load
> -                * to the preferred CPU
> +                * In case of asym capacity, we will try to migrate all load to
> +                * the preferred CPU
>                  */
>                 env->balance_type = migrate_load;
>                 env->imbalance = busiest->group_load;
> @@ -8483,7 +8484,8 @@ static inline void calculate_imbalance(s
>                          * groups.
>                          */
>                         env->balance_type = migrate_task;
> -                       env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> +                       env->imbalance = (busiest->sum_h_nr_running -
> +                                         local->sum_h_nr_running) >> 1;
>                         return;
>                 }
>
> @@ -8502,7 +8504,7 @@ static inline void calculate_imbalance(s
>          */
>         if (local->group_type < group_overloaded) {
>                 /*
> -                * Local will become overvloaded so the avg_load metrics are
> +                * Local will become overloaded so the avg_load metrics are
>                  * finally needed
>                  */
>
> @@ -8514,12 +8516,12 @@ static inline void calculate_imbalance(s
>         }
>
>         /*
> -        * Both group are or will become overloaded and we're trying to get
> -        * all the CPUs to the average_load, so we don't want to push
> -        * ourselves above the average load, nor do we wish to reduce the
> -        * max loaded CPU below the average load. At the same time, we also
> -        * don't want to reduce the group load below the group capacity.
> -        * Thus we look for the minimum possible imbalance.
> +        * Both group are or will become overloaded and we're trying to get all
> +        * the CPUs to the average_load, so we don't want to push ourselves
> +        * above the average load, nor do we wish to reduce the max loaded CPU
> +        * below the average load. At the same time, we also don't want to
> +        * reduce the group load below the group capacity.  Thus we look for
> +        * the minimum possible imbalance.
>          */
>         env->balance_type = migrate_load;
>         env->imbalance = min(
> @@ -8641,7 +8643,6 @@ static struct sched_group *find_busiest_
>                 if (100 * busiest->avg_load <=
>                                 env->sd->imbalance_pct * local->avg_load)
>                         goto out_balanced;
> -
>         }
>
>         /* Try to move all excess tasks to child's sibling domain */
> @@ -8651,7 +8652,7 @@ static struct sched_group *find_busiest_
>
>         if (busiest->group_type != group_overloaded &&
>              (env->idle == CPU_NOT_IDLE ||
> -             local->idle_cpus <= (busiest->idle_cpus + 1)))
> +             local->idle_cpus <= (busiest->idle_cpus + 1))) {
>                 /*
>                  * If the busiest group is not overloaded
>                  * and there is no imbalance between this and busiest group
> @@ -8661,15 +8662,14 @@ static struct sched_group *find_busiest_
>                  * group.
>                  */
>                 goto out_balanced;
> +       }
>
>  force_balance:
>         /* Looks like there is an imbalance. Compute it */
>         calculate_imbalance(env, &sds);
> -
>         return env->imbalance ? sds.busiest : NULL;
>
>  out_balanced:
> -
>         env->imbalance = 0;
>         return NULL;
>  }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-06 17:17   ` Valentin Schneider
  2019-08-07 11:16     ` Valentin Schneider
@ 2019-08-26 10:11     ` Vincent Guittot
  2019-08-28 14:19       ` Valentin Schneider
  1 sibling, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2019-08-26 10:11 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On Tue, 6 Aug 2019 at 19:17, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Second batch, get it while it's hot...
>
> On 01/08/2019 15:40, Vincent Guittot wrote:
> [...]
> > @@ -7438,19 +7453,53 @@ static int detach_tasks(struct lb_env *env)
> >               if (!can_migrate_task(p, env))
> >                       goto next;
> >
> > -             load = task_h_load(p);
> > +             switch (env->balance_type) {
> > +             case migrate_task:
> > +                     /* Migrate task */
> > +                     env->imbalance--;
> > +                     break;
> >
> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > -                     goto next;
> > +             case migrate_util:
> > +                     util = task_util_est(p);
> >
> > -             if ((load / 2) > env->imbalance)
> > -                     goto next;
> > +                     if (util > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= util;
> > +                     break;
> > +
> > +             case migrate_load:
> > +                     load = task_h_load(p);
> > +
> > +                     if (sched_feat(LB_MIN) &&
> > +                         load < 16 && !env->sd->nr_balance_failed)
> > +                             goto next;
> > +
> > +                     if ((load / 2) > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= load;
> > +                     break;
> > +
> > +             case migrate_misfit:
> > +                     load = task_h_load(p);
> > +
> > +                     /*
> > +                      * utilization of misfit task might decrease a bit
> > +                      * since it has been recorded. Be conservative in the
> > +                      * condition.
> > +                      */
> > +                     if (load < env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance = 0;
> > +                     break;
> > +             }
> >
> >               detach_task(p, env);
> >               list_add(&p->se.group_node, &env->tasks);
> >
> >               detached++;
> > -             env->imbalance -= load;
> >
>
> It's missing something like this:

yes

>
> -----8<-----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0631ff2a4bd..055563e19090 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7506,7 +7506,7 @@ static int detach_tasks(struct lb_env *env)
>
>                 /*
>                  * We only want to steal up to the prescribed amount of
> -                * runnable load.
> +                * load/util/tasks
>                  */
>                 if (env->imbalance <= 0)
>                         break;
> ----->8-----
>
> [...]
> > @@ -8013,19 +8059,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> >  }
> >
> >  static inline enum
> > -group_type group_classify(struct sched_group *group,
> > +group_type group_classify(struct lb_env *env,
> > +                       struct sched_group *group,
> >                         struct sg_lb_stats *sgs)
> >  {
> > -     if (sgs->group_no_capacity)
> > +     if (group_is_overloaded(env, sgs))
> >               return group_overloaded;
> >
> >       if (sg_imbalanced(group))
> >               return group_imbalanced;
> >
> > +     if (sgs->group_asym_capacity)
> > +             return group_asym_capacity;
> > +
> >       if (sgs->group_misfit_task_load)
> >               return group_misfit_task;
> >
> > -     return group_other;
> > +     if (!group_has_capacity(env, sgs))
> > +             return group_fully_busy;
>
> If I got my booleans right, reaching group_fully_busy means
> !group_is_overloaded() && !group_has_capacity(), which gives us:
>
> - If nr_running > group_weight, then we had to have
>   sgs->group_capacity * 100 == sgs->group_util * env->sd->imbalance_pct
>
> - If nr_running == group_weight, then we had to have
>   sgs->group_capacity * 100 <= sgs->group_util * env->sd->imbalance_pct
>
> - (if nr_running < group_weight we go to group_has_spare)
>
> That first equality seems somewhat odd considering how rarely it will
> occur, but then we still want the util check to fall down to
> group_has_spare when
>
>   nr_running >= group_weight &&
>   sgs->group_capacity * 100 > sgs->group_util * env->sd->imbalance_pct
>
> Maybe we could change group_has_capacity()'s util comparison from '>' to
> '>=' to avoid this weird "artefact".
>
> > +
> > +     return group_has_spare;
> >  }
> >
> >  static bool update_nohz_stats(struct rq *rq, bool force)
>
> [...]
> > @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >       if (sgs->group_type < busiest->group_type)
> >               return false;
> >
> > -     if (sgs->avg_load <= busiest->avg_load)
> > +     /*
> > +      * The candidate and the current busiest group are the same type of
> > +      * group. Let check which one is the busiest according to the type.
> > +      */
> > +
> > +     switch (sgs->group_type) {
> > +     case group_overloaded:
> > +             /* Select the overloaded group with highest avg_load. */
> > +             if (sgs->avg_load <= busiest->avg_load)
> > +                     return false;
> > +             break;
> > +
> > +     case group_imbalanced:
> > +             /* Select the 1st imbalanced group as we don't have
> > +              * any way to choose one more than another
> > +              */
> >               return false;
> > +             break;
> >
> > -     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> > -             goto asym_packing;
> > +     case group_asym_capacity:
> > +             /* Prefer to move from lowest priority CPU's work */
> > +             if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > +                      return false;
> > +             break;
> > +
> > +     case group_misfit_task:
> > +             /*
> > +              * If we have more than one misfit sg go with the
> > +              * biggest misfit.
> > +              */
> > +             if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> > +                     return false;
> > +             break;
> > +
> > +     case group_fully_busy:
> > +             /*
> > +              * Select the fully busy group with highest avg_load.
> > +              * In theory, there is no need to pull task from such
> > +              * kind of group because tasks have all compute
> > +              * capacity that they need but we can still improve the
> > +              * overall throughput by reducing contention when
> > +              * accessing shared HW resources.
> > +              * XXX for now avg_load is not computed and always 0 so
> > +              * we select the 1st one.
> > +              */
>
> What's wrong with unconditionally computing avg_load in update_sg_lb_stats()?

removing useless division which can be expensive

> We already unconditionally accumulate group_load anyway.

accumulation must be done while looping on the group whereas computing
avg_load can be done only when needed

>
> If it's to make way for patch 6/8 (using load instead of runnable load),
> then I think you are doing things in the wrong order. IMO in this patch we
> should unconditionally compute avg_load (using runnable load), and then
> you could tweak it up in a subsequent patch.

In fact, it's not link to patch 6/8.
It's only that I initially wanted to used load only when overloaded
but then I got this case and thought that comparing avg_load could be
interesting but without any proof that it's worth.
As mentioned in the comment, tasks in this group have enough capacity
and there is no need to move task in theory. This is there mainly to
trigger the discuss and keep in mind a possible area of improvement in
a next step.
I haven't run tests or done more study on this particular case to make
sure that there would be some benefit to compare avg_load.

So in the future, we might end up always computing avg_load and
comparing it for selecting busiest fully busy group

>
> > +             if (sgs->avg_load <= busiest->avg_load)
> > +                     return false;
> > +             break;
> > +
> > +     case group_has_spare:
> > +             /*
> > +              * Select not overloaded group with lowest number of
> > +              * idle cpus. We could also compare the spare capacity
> > +              * which is more stable but it can end up that the
> > +              * group has less spare capacity but finally more idle
> > +              * cpus which means less opportunity to pull tasks.
> > +              */
> > +             if (sgs->idle_cpus >= busiest->idle_cpus)
> > +                     return false;
> > +             break;
> > +     }
> >
> >       /*
> >        * Candidate sg has no more than one task per CPU and
>
> [...]
> > @@ -8341,69 +8421,132 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >   */
> >  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> >  {
> > -     unsigned long max_pull, load_above_capacity = ~0UL;
> >       struct sg_lb_stats *local, *busiest;
> >
> >       local = &sds->local_stat;
> >       busiest = &sds->busiest_stat;
> > +     if (busiest->group_type == group_imbalanced) {
> > +             /*
> > +              * In the group_imb case we cannot rely on group-wide averages
> > +              * to ensure CPU-load equilibrium, try to move any task to fix
> > +              * the imbalance. The next load balance will take care of
> > +              * balancing back the system.
> > +              */
> > +             env->balance_type = migrate_task;
> > +             env->imbalance = 1;
> > +             return;
> > +     }
> >
> > -     if (busiest->group_asym_capacity) {
> > +     if (busiest->group_type == group_asym_capacity) {
> > +             /*
> > +              * In case of asym capacity, we will try to migrate all load
> > +              * to the preferred CPU
> > +              */
> > +             env->balance_type = migrate_load;
> >               env->imbalance = busiest->group_load;
> >               return;
> >       }
> >
> > +     if (busiest->group_type == group_misfit_task) {
> > +             /* Set imbalance to allow misfit task to be balanced. */
> > +             env->balance_type = migrate_misfit;
> > +             env->imbalance = busiest->group_misfit_task_load;
>
> AFAICT we don't ever use this value, other than setting it to 0 in
> detach_tasks(), so what we actually set it to doesn't matter (as long as
> it's > 0).

not yet.
it's only in patch 8/8 that we check if the tasks fits the cpu's
capacity during the detach_tasks

>
> I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
> we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
> though it makes some other places somewhat uglier. The good thing is that
> it actually does end up being implemented as a special kind of task
> migration, rather than being loosely related.

I prefer to keep it separate instead of adding a sub case in migrate_task

>
> Below's a diff, only compile-tested but should help show my point.
> -----8<-----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a8681c32445b..b0631ff2a4bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7155,7 +7155,6 @@ enum migration_type {
>         migrate_task = 0,
>         migrate_util,
>         migrate_load,
> -       migrate_misfit,
>  };
>
>  #define LBF_ALL_PINNED 0x01
> @@ -7188,6 +7187,7 @@ struct lb_env {
>         unsigned int            loop_max;
>
>         enum fbq_type           fbq_type;
> +       enum group_type         src_grp_type;
>         enum migration_type     balance_type;
>         struct list_head        tasks;
>  };
> @@ -7455,6 +7455,13 @@ static int detach_tasks(struct lb_env *env)
>
>                 switch (env->balance_type) {
>                 case migrate_task:
> +                       /* Check for special kinds of tasks */
> +                       if (env->src_grp_type == group_misfit_task) {
> +                               /* This isn't the misfit task */
> +                               if (task_fits_capacity(p, capacity_of(env->src_cpu)))
> +                                       goto next;
> +                       }
> +
>                         /* Migrate task */
>                         env->imbalance--;
>                         break;
> @@ -7480,20 +7487,6 @@ static int detach_tasks(struct lb_env *env)
>
>                         env->imbalance -= load;
>                         break;
> -
> -               case migrate_misfit:
> -                       load = task_h_load(p);
> -
> -                       /*
> -                        * utilization of misfit task might decrease a bit
> -                        * since it has been recorded. Be conservative in the
> -                        * condition.
> -                        */
> -                       if (load < env->imbalance)
> -                               goto next;
> -
> -                       env->imbalance = 0;
> -                       break;
>                 }
>
>                 detach_task(p, env);
> @@ -8449,8 +8442,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>
>         if (busiest->group_type == group_misfit_task) {
>                 /* Set imbalance to allow misfit task to be balanced. */
> -               env->balance_type = migrate_misfit;
> -               env->imbalance = busiest->group_misfit_task_load;
> +               env->balance_type = migrate_task;
> +               env->imbalance = 1;
>                 return;
>         }
>
> @@ -8661,8 +8654,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>
>  force_balance:
>         /* Looks like there is an imbalance. Compute it */
> +       env->src_grp_type = busiest->group_type;
>         calculate_imbalance(env, &sds);
> -
>         return env->imbalance ? sds.busiest : NULL;
>
>  out_balanced:
> @@ -8728,7 +8721,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>
>                 switch (env->balance_type) {
>                 case migrate_task:
> -                       if (busiest_nr < nr_running) {
> +                       /*
> +                        * For ASYM_CPUCAPACITY domains with misfit tasks we simply
> +                        * seek the "biggest" misfit task.
> +                        */
> +                       if (env->src_grp_type == group_misfit_task &&
> +                           rq->misfit_task_load > busiest_load) {
> +                               busiest_load = rq->misfit_task_load;
> +                               busiest = rq;
> +                       } else if (busiest_nr < nr_running) {
>                                 busiest_nr = nr_running;
>                                 busiest = rq;
>                         }
> @@ -8771,19 +8772,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                                 busiest = rq;
>                         }
>                         break;
> -
> -               case migrate_misfit:
> -                       /*
> -                        * For ASYM_CPUCAPACITY domains with misfit tasks we simply
> -                        * seek the "biggest" misfit task.
> -                        */
> -                       if (rq->misfit_task_load > busiest_load) {
> -                               busiest_load = rq->misfit_task_load;
> -                               busiest = rq;
> -                       }
> -
> -                       break;
> -
>                 }
>         }
>
> @@ -8829,7 +8817,7 @@ voluntary_active_balance(struct lb_env *env)
>                         return 1;
>         }
>
> -       if (env->balance_type == migrate_misfit)
> +       if (env->src_grp_type == group_misfit_task)
>                 return 1;
>
>         return 0;
> ----->8-----
>
>
> > +             return;
> > +     }
> > +
>
> Also, I think these three busiest->group_type conditions could be turned
> into a switch case.
>
> >       /*
> > -      * Avg load of busiest sg can be less and avg load of local sg can
> > -      * be greater than avg load across all sgs of sd because avg load
> > -      * factors in sg capacity and sgs with smaller group_type are
> > -      * skipped when updating the busiest sg:
> > +      * Try to use spare capacity of local group without overloading it or
> > +      * emptying busiest
> >        */
> > -     if (busiest->group_type != group_misfit_task &&
> > -         (busiest->avg_load <= sds->avg_load ||
> > -          local->avg_load >= sds->avg_load)) {
> > -             env->imbalance = 0;
> > +     if (local->group_type == group_has_spare) {
> > +             if (busiest->group_type > group_fully_busy) {
> > +                     /*
> > +                      * If busiest is overloaded, try to fill spare
> > +                      * capacity. This might end up creating spare capacity
> > +                      * in busiest or busiest still being overloaded but
> > +                      * there is no simple way to directly compute the
> > +                      * amount of load to migrate in order to balance the
> > +                      * system.
> > +                      */
> > +                     env->balance_type = migrate_util;
> > +                     env->imbalance = max(local->group_capacity, local->group_util) -
> > +                                 local->group_util;
> > +                     return;
> > +             }
> > +
> > +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > +                     /*
> > +                      * When prefer sibling, evenly spread running tasks on
> > +                      * groups.
> > +                      */
> > +                     env->balance_type = migrate_task;
> > +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> > +                     return;
> > +             }
> > +
> > +             /*
> > +              * If there is no overload, we just want to even the number of
> > +              * idle cpus.
> > +              */
> > +             env->balance_type = migrate_task;
> > +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> >               return;
> >       }
> >
> >       /*
> > -      * If there aren't any idle CPUs, avoid creating some.
> > +      * Local is fully busy but have to take more load to relieve the
> > +      * busiest group
> >        */
> > -     if (busiest->group_type == group_overloaded &&
> > -         local->group_type   == group_overloaded) {
> > -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> > -             if (load_above_capacity > busiest->group_capacity) {
> > -                     load_above_capacity -= busiest->group_capacity;
> > -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
> > -                     load_above_capacity /= busiest->group_capacity;
> > -             } else
> > -                     load_above_capacity = ~0UL;
> > +     if (local->group_type < group_overloaded) {
> > +             /*
> > +              * Local will become overvloaded so the avg_load metrics are
>                                      ^^^^^^^^^^^
>                             s/overvloaded/overloaded/
>
> > +              * finally needed
> > +              */
> > +
> > +             local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
> > +                               local->group_capacity;
> > +
> > +             sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
> > +                             sds->total_capacity;
> >       }
> >
> >       /*
> > -      * We're trying to get all the CPUs to the average_load, so we don't
> > -      * want to push ourselves above the average load, nor do we wish to
> > -      * reduce the max loaded CPU below the average load. At the same time,
> > -      * we also don't want to reduce the group load below the group
> > -      * capacity. Thus we look for the minimum possible imbalance.
> > +      * Both group are or will become overloaded and we're trying to get
> > +      * all the CPUs to the average_load, so we don't want to push
> > +      * ourselves above the average load, nor do we wish to reduce the
> > +      * max loaded CPU below the average load. At the same time, we also
> > +      * don't want to reduce the group load below the group capacity.
> > +      * Thus we look for the minimum possible imbalance.
> >        */
> > -     max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
> > -
> > -     /* How much load to actually move to equalise the imbalance */
> > +     env->balance_type = migrate_load;
> >       env->imbalance = min(
> > -             max_pull * busiest->group_capacity,
> > +             (busiest->avg_load - sds->avg_load) * busiest->group_capacity,
> >               (sds->avg_load - local->avg_load) * local->group_capacity
> >       ) / SCHED_CAPACITY_SCALE;
> > -
> > -     /* Boost imbalance to allow misfit task to be balanced. */
> > -     if (busiest->group_type == group_misfit_task) {
> > -             env->imbalance = max_t(long, env->imbalance,
> > -                                    busiest->group_misfit_task_load);
> > -     }
> > -
> >  }
> >
> >  /******* find_busiest_group() helpers end here *********************/
> >
> > +/*
> > + * Decision matrix according to the local and busiest group state
> > + *
> > + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> > + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> > + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> > + * misfit_task      force     N/A        N/A    N/A  force      force
> > + * asym_capacity    force     force      N/A    N/A  force      force
> > + * imbalanced       force     force      N/A    N/A  force      force
> > + * overloaded       force     force      N/A    N/A  force      avg_load
> > + *
> > + * N/A :      Not Applicable because already filtered while updating
> > + *            statistics.
> > + * balanced : The system is balanced for these 2 groups.
> > + * force :    Calculate the imbalance as load migration is probably needed.
> > + * avg_load : Only if imbalance is significant enough.
> > + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
> > + *            different in groups.
> > + */
> > +
> >  /**
> >   * find_busiest_group - Returns the busiest group within the sched_domain
> >   * if there is an imbalance.
>
> [...]
> > @@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >               goto force_balance;
> >
>
> As for their counterpart in calculate_imbalance(), maybe go for a switch?
>
> Also, it would be nice if the ordering of these matched the one in
> calculate_imbalance() - right now it's the exact reverse order.
>
> Finally, would it hurt much to just move the balance_type and imbalance
> computation for these group types here? It breaks the nice partitions
> you've set up between find_busiest_group() and calculate_imbalance(),
> but it leads to less redirections for what in the end is just a simple
> condition.
>
> >       /*
> > -      * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> > -      * capacities from resulting in underutilization due to avg_load.
> > -      */
> > -     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> > -         busiest->group_no_capacity)
> > -             goto force_balance;
> > -
> > -     /* Misfit tasks should be dealt with regardless of the avg load */
> > -     if (busiest->group_type == group_misfit_task)
> > -             goto force_balance;
> > -
> > -     /*
> >        * If the local group is busier than the selected busiest group
> >        * don't try and pull any tasks.
> >        */
> > -     if (local->avg_load >= busiest->avg_load)
> > +     if (local->group_type > busiest->group_type)
> >               goto out_balanced;
> >
> >       /*
> > -      * Don't pull any tasks if this group is already above the domain
> > -      * average load.
> > +      * When groups are overloaded, use the avg_load to ensure fairness
> > +      * between tasks.
> >        */
> > -     if (local->avg_load >= sds.avg_load)
> > -             goto out_balanced;
> > +     if (local->group_type == group_overloaded) {
> > +             /*
> > +              * If the local group is more loaded than the selected
> > +              * busiest group don't try and pull any tasks.
> > +              */
> > +             if (local->avg_load >= busiest->avg_load)
> > +                     goto out_balanced;
> > +
> > +             /* XXX broken for overlapping NUMA groups */
> > +             sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
> > +                             sds.total_capacity;
> >
> > -     if (env->idle == CPU_IDLE) {
> >               /*
> > -              * This CPU is idle. If the busiest group is not overloaded
> > -              * and there is no imbalance between this and busiest group
> > -              * wrt idle CPUs, it is balanced. The imbalance becomes
> > -              * significant if the diff is greater than 1 otherwise we
> > -              * might end up to just move the imbalance on another group
> > +              * Don't pull any tasks if this group is already above the
> > +              * domain average load.
> >                */
> > -             if ((busiest->group_type != group_overloaded) &&
> > -                             (local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +             if (local->avg_load >= sds.avg_load)
> >                       goto out_balanced;
> > -     } else {
> > +
> >               /*
> > -              * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> > -              * imbalance_pct to be conservative.
> > +              * If the busiest group is more loaded, use imbalance_pct to be
> > +              * conservative.
> >                */
> >               if (100 * busiest->avg_load <=
> >                               env->sd->imbalance_pct * local->avg_load)
> >                       goto out_balanced;
> > +
> >       }
> >
> > +     /* Try to move all excess tasks to child's sibling domain */
> > +     if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > +         busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> > +             goto force_balance;
> > +
> > +     if (busiest->group_type != group_overloaded &&
> > +          (env->idle == CPU_NOT_IDLE ||
>
> Disregarding the idle_cpus count, we never balance load when the busiest
> is < group_misfit_task and we're CPU_NOT_IDLE.
>
> I *think* that's okay, since AFAICT that should mean
>
>   (local)   nr_running < group_weight
>   (busiest) nr_running <= group_weight
>             (or that weird == case)
>
> and if we (local) are not idle then we shouldn't pull anything. Bleh, guess
> it made me scratch my head for nothing.
>
> > +           local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +             /*
> > +              * If the busiest group is not overloaded
> > +              * and there is no imbalance between this and busiest group
> > +              * wrt idle CPUs, it is balanced. The imbalance
> > +              * becomes significant if the diff is greater than 1 otherwise
> > +              * we might end up to just move the imbalance on another
> > +              * group.
> > +              */
> > +             goto out_balanced;
> > +
> >  force_balance:
> >       /* Looks like there is an imbalance. Compute it */
> > -     env->src_grp_type = busiest->group_type;
> >       calculate_imbalance(env, &sds);
> > +
> >       return env->imbalance ? sds.busiest : NULL;
> >
> >  out_balanced:
> > +
> >       env->imbalance = 0;
> >       return NULL;
> >  }
>
> [...]
> > @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >       env.src_rq = busiest;
> >
> >       ld_moved = 0;
> > -     if (busiest->cfs.h_nr_running > 1) {
> > +     if (busiest->nr_running > 1) {
>
> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
> tasks.

There is the case raised by srikar where we have for example 1 RT task
and 1 CFS task. cfs.h_nr_running==1 but we don't need active balance
because CFS is not the running task

>
> >               /*
> >                * Attempt to move tasks. If find_busiest_group has found
> >                * an imbalance but busiest->nr_running <= 1, the group is
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/fair: use load instead of runnable load
  2019-08-06 16:07   ` Peter Zijlstra
@ 2019-08-26 15:45     ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-26 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On Tue, 6 Aug 2019 at 18:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 01, 2019 at 04:40:22PM +0200, Vincent Guittot wrote:
> > runnable load has been introduced to take into account the case
> > where blocked load biases the load balance decision which was selecting
> > underutilized group with huge blocked load whereas other groups were
> > overloaded.
> >
> > The load is now only used when groups are overloaded. In this case,
> > it's worth being conservative and taking into account the sleeping
> > tasks that might wakeup on the cpu.
>
> This one scares me a little. I have the feeling I'm missing/forgetting
> something.
>
> Also; while the regular load-balance (find-busiest) stuff is now all
> aware of idle, this change also impacts wake_affine and find_idlest, and
> they've not changed.

Yes. I thought about this a bit before applying this changes to all
cpu_runnable_load.
-For wake_affine, it starts by looking at an idle cpu with
wake_affine_idle which is similar in some way to the new load balance
approach even if it might need more changes to align more closely both
paths.
-For find_idlest, it looks at the group with most spare capacity in
priority and fall back to load. That being said I have overlooked that
 both load and runnable load are already used and cpu.load is saved
twice instead of one.

I can't remember detailed results but this patch is responsible of
part of perf improvements described in the cover letter

Also, I haven't touch the numa stats which still use runnable_load.
But this should be addressed too

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-26  9:26     ` Vincent Guittot
@ 2019-08-28 10:25       ` Valentin Schneider
  0 siblings, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-28 10:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On 26/08/2019 10:26, Vincent Guittot wrote:
[...]
>>>   busiest group.
>>> - calculate_imbalance() decides what have to be moved.
>>
>> That's nothing new, isn't it? I think what you mean there is that the
> 
> There is 2 things:
> -part of the algorithm is new and fixes wrong task placement
> -everything has been consolidated in the 3 functions above whereas
> there were some bypasses and hack in the current code
> 

Right, something like that could be added in the changelog then.

[...]
>>> @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
>>>  struct sd_lb_stats {
>>>       struct sched_group *busiest;    /* Busiest group in this sd */
>>>       struct sched_group *local;      /* Local group in this sd */
>>> -     unsigned long total_running;
>>
>> Could be worth calling out in the log that this gets snipped out. Or it
>> could go into its own small cleanup patch, since it's just an unused field.
> 
> I can mention it more specifically in the log but that's part of those
> meaningless metrics which is no more used

I'm a git blame addict so I like having things split up as much as possible
(within reason). Since that cleanup can live in its own patch, it should
be split as such IMO.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-26 10:11     ` Vincent Guittot
@ 2019-08-28 14:19       ` Valentin Schneider
  2019-08-29 14:26         ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: Valentin Schneider @ 2019-08-28 14:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On 26/08/2019 11:11, Vincent Guittot wrote:
>>> +     case group_fully_busy:
>>> +             /*
>>> +              * Select the fully busy group with highest avg_load.
>>> +              * In theory, there is no need to pull task from such
>>> +              * kind of group because tasks have all compute
>>> +              * capacity that they need but we can still improve the
>>> +              * overall throughput by reducing contention when
>>> +              * accessing shared HW resources.
>>> +              * XXX for now avg_load is not computed and always 0 so
>>> +              * we select the 1st one.
>>> +              */
>>
>> What's wrong with unconditionally computing avg_load in update_sg_lb_stats()?
> 
> removing useless division which can be expensive
> 

Seeing how much stuff we already do in just computing the stats, do we
really save that much by doing this? I'd expect it to be negligible with
modern architectures and all of the OoO/voodoo, but maybe I need a 
refresher course.

>> We already unconditionally accumulate group_load anyway.
> 
> accumulation must be done while looping on the group whereas computing
> avg_load can be done only when needed
> 
>>
>> If it's to make way for patch 6/8 (using load instead of runnable load),
>> then I think you are doing things in the wrong order. IMO in this patch we
>> should unconditionally compute avg_load (using runnable load), and then
>> you could tweak it up in a subsequent patch.
> 
> In fact, it's not link to patch 6/8.
> It's only that I initially wanted to used load only when overloaded
> but then I got this case and thought that comparing avg_load could be
> interesting but without any proof that it's worth.
> As mentioned in the comment, tasks in this group have enough capacity
> and there is no need to move task in theory. This is there mainly to
> trigger the discuss and keep in mind a possible area of improvement in
> a next step.
> I haven't run tests or done more study on this particular case to make
> sure that there would be some benefit to compare avg_load.
> 
> So in the future, we might end up always computing avg_load and
> comparing it for selecting busiest fully busy group
> 

Okay, that definitely wants testing then.

[...]
>>> +     if (busiest->group_type == group_misfit_task) {
>>> +             /* Set imbalance to allow misfit task to be balanced. */
>>> +             env->balance_type = migrate_misfit;
>>> +             env->imbalance = busiest->group_misfit_task_load;
>>
>> AFAICT we don't ever use this value, other than setting it to 0 in
>> detach_tasks(), so what we actually set it to doesn't matter (as long as
>> it's > 0).
> 
> not yet.
> it's only in patch 8/8 that we check if the tasks fits the cpu's
> capacity during the detach_tasks
> 

But that doesn't use env->imbalance, right? With that v3 patch it's just
the task util's, so AFAICT my comment still stands.

>>
>> I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
>> we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
>> though it makes some other places somewhat uglier. The good thing is that
>> it actually does end up being implemented as a special kind of task
>> migration, rather than being loosely related.
> 
> I prefer to keep it separate instead of adding a sub case in migrate_task
> 

My argument here is that ideally they shouldn't be separated, since the misfit
migration is a subcase of task migration (or an extension of it - in any
case, they're related). I haven't found a nicer way to express it though,
and I agree that the special casing in detach_tasks()/fbq()/etc is meh.

[...]
>>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>       env.src_rq = busiest;
>>>
>>>       ld_moved = 0;
>>> -     if (busiest->cfs.h_nr_running > 1) {
>>> +     if (busiest->nr_running > 1) {
>>
>> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
>> tasks.
> 
> There is the case raised by srikar where we have for example 1 RT task
> and 1 CFS task. cfs.h_nr_running==1 but we don't need active balance
> because CFS is not the running task
> 
>>
>>>               /*
>>>                * Attempt to move tasks. If find_busiest_group has found
>>>                * an imbalance but busiest->nr_running <= 1, the group is
>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-28 14:19       ` Valentin Schneider
@ 2019-08-29 14:26         ` Vincent Guittot
  2019-08-30 14:33           ` Valentin Schneider
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2019-08-29 14:26 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On Wed, 28 Aug 2019 at 16:19, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 26/08/2019 11:11, Vincent Guittot wrote:
> >>> +     case group_fully_busy:
> >>> +             /*
> >>> +              * Select the fully busy group with highest avg_load.
> >>> +              * In theory, there is no need to pull task from such
> >>> +              * kind of group because tasks have all compute
> >>> +              * capacity that they need but we can still improve the
> >>> +              * overall throughput by reducing contention when
> >>> +              * accessing shared HW resources.
> >>> +              * XXX for now avg_load is not computed and always 0 so
> >>> +              * we select the 1st one.
> >>> +              */
> >>
> >> What's wrong with unconditionally computing avg_load in update_sg_lb_stats()?
> >
> > removing useless division which can be expensive
> >
>
> Seeing how much stuff we already do in just computing the stats, do we
> really save that much by doing this? I'd expect it to be negligible with
> modern architectures and all of the OoO/voodoo, but maybe I need a
> refresher course.

We are not only running on top/latest architecture

>
> >> We already unconditionally accumulate group_load anyway.
> >
> > accumulation must be done while looping on the group whereas computing
> > avg_load can be done only when needed
> >
> >>
> >> If it's to make way for patch 6/8 (using load instead of runnable load),
> >> then I think you are doing things in the wrong order. IMO in this patch we
> >> should unconditionally compute avg_load (using runnable load), and then
> >> you could tweak it up in a subsequent patch.
> >
> > In fact, it's not link to patch 6/8.
> > It's only that I initially wanted to used load only when overloaded
> > but then I got this case and thought that comparing avg_load could be
> > interesting but without any proof that it's worth.
> > As mentioned in the comment, tasks in this group have enough capacity
> > and there is no need to move task in theory. This is there mainly to
> > trigger the discuss and keep in mind a possible area of improvement in
> > a next step.
> > I haven't run tests or done more study on this particular case to make
> > sure that there would be some benefit to compare avg_load.
> >
> > So in the future, we might end up always computing avg_load and
> > comparing it for selecting busiest fully busy group
> >
>
> Okay, that definitely wants testing then.
>
> [...]
> >>> +     if (busiest->group_type == group_misfit_task) {
> >>> +             /* Set imbalance to allow misfit task to be balanced. */
> >>> +             env->balance_type = migrate_misfit;
> >>> +             env->imbalance = busiest->group_misfit_task_load;
> >>
> >> AFAICT we don't ever use this value, other than setting it to 0 in
> >> detach_tasks(), so what we actually set it to doesn't matter (as long as
> >> it's > 0).
> >
> > not yet.
> > it's only in patch 8/8 that we check if the tasks fits the cpu's
> > capacity during the detach_tasks
> >
>
> But that doesn't use env->imbalance, right? With that v3 patch it's just
> the task util's, so AFAICT my comment still stands.

no, misfit case keeps using load and imbalance like the current
implementation in this patch.
The modifications on the way to handle misfit task are all in patch 8

>
> >>
> >> I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
> >> we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
> >> though it makes some other places somewhat uglier. The good thing is that
> >> it actually does end up being implemented as a special kind of task
> >> migration, rather than being loosely related.
> >
> > I prefer to keep it separate instead of adding a sub case in migrate_task
> >
>
> My argument here is that ideally they shouldn't be separated, since the misfit
> migration is a subcase of task migration (or an extension of it - in any
> case, they're related). I haven't found a nicer way to express it though,
> and I agree that the special casing in detach_tasks()/fbq()/etc is meh.
>
> [...]
> >>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >>>       env.src_rq = busiest;
> >>>
> >>>       ld_moved = 0;
> >>> -     if (busiest->cfs.h_nr_running > 1) {
> >>> +     if (busiest->nr_running > 1) {
> >>
> >> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
> >> tasks.
> >
> > There is the case raised by srikar where we have for example 1 RT task
> > and 1 CFS task. cfs.h_nr_running==1 but we don't need active balance
> > because CFS is not the running task
> >
> >>
> >>>               /*
> >>>                * Attempt to move tasks. If find_busiest_group has found
> >>>                * an imbalance but busiest->nr_running <= 1, the group is
> >>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/8] sched/fair: rework the CFS load balance
  2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
                   ` (7 preceding siblings ...)
  2019-08-01 14:40 ` [PATCH v2 8/8] sched/fair: use utilization to select misfit task Vincent Guittot
@ 2019-08-29 19:23 ` Phil Auld
  2019-08-30  6:46   ` Vincent Guittot
       [not found] ` <20190809052124.13016-1-hdanton@sina.com>
  9 siblings, 1 reply; 31+ messages in thread
From: Phil Auld @ 2019-08-29 19:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, peterz, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen

On Thu, Aug 01, 2019 at 04:40:16PM +0200 Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
> 
> Patch 1 has already been sent separatly and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
> 
> Patch 2 renames the sum of h_nr_running in stats.
> 
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
> 
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
> 
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
> 
> Patch 6 replaces runnable_load by load now that the metrics is only used
> when overloaded.
> 
> Patch 7 improves the spread of tasks at the 1st scheduling level.
> 
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
> 
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)
> 
> hackbench
> 1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)
> 
> - large arm64 2 nodes / 224 cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe     125665 +/-0.61%       125455 +/-0.62%   (-0.17%)
> 
> hackbench -l (256000/#grp) -g #grp
> 1 groups      15.263 +/-3.53%       13.776 +/-3.30%   (+9.74%)
> 4 groups       5.852 +/-0.57%        5.340 +/-8.03%   (+8.75%)
> 16 groups      3.097 +/-1.08%        3.246 +/-0.97%   (-4.81%)
> 32 groups      2.882 +/-1.04%        2.845 +/-1.02%   (+1.29%)
> 64 groups      2.809 +/-1.30%        2.712 +/-1.17%   (+3.45%)
> 128 groups     3.129 +/-9.74%        2.813 +/-6.22%   (+9.11%)
> 256 groups     3.559 +/-11.07%       3.020 +/-1.75%  (+15.15%)
> 
> dbench
> 1 groups     330.897 +/-0.27%      330.612 +/-0.77%   (-0.09%)
> 4 groups     932.922 +/-0.54%      941.817 +/*1.10%   (+0.95%)
> 16 groups   1932.346 +/-1.37%     1962.944 +/-0.62%   (+1.58%)
> 32 groups   2251.079 +/-7.93%     2418.531 +/-0.69%   (+7.44%)
> 64 groups   2104.114 +/-9.67%     2348.698 +/-11.24% (+11.62%)
> 128 groups  2093.756 +/-7.26%     2278.156 +/-9.74%   (+8.81%)
> 256 groups  1216.736 +/-2.46%     1665.774 +/-4.68%  (+36.91%)
> 
> tip/sched/core sha1:
>   a1dc0446d649 ('sched/core: Silence a warning in sched_init()')
> 
> Changes since v1:
> - fixed some bugs
> - Used switch case
> - Renamed env->src_grp_type to env->balance_type
> - split patches in smaller ones
> - added comments
> 
> Vincent Guittot (8):
>   sched/fair: clean up asym packing
>   sched/fair: rename sum_nr_running to sum_h_nr_running
>   sched/fair: remove meaningless imbalance calculation
>   sched/fair: rework load_balance
>   sched/fair: use rq->nr_running when balancing load
>   sched/fair: use load instead of runnable load
>   sched/fair: evenly spread tasks when not overloaded
>   sched/fair: use utilization to select misfit task
> 
>  kernel/sched/fair.c  | 769 ++++++++++++++++++++++++++++-----------------------
>  kernel/sched/sched.h |   2 +-
>  2 files changed, 419 insertions(+), 352 deletions(-)
> 
> -- 
> 2.7.4
> 

I keep expecting a v3 so I have not dug into all the patches in detail. However, I've 
been working with them from Vincent's tree while they were under development so I thought 
I'd add some results.

The workload is a test our perf team came up with to illustrate the issues we were seeing
with imbalance in the presence of group scheduling. 

On a 4-numa X 20 cpu system (smt on) we run a 76 thread lu.C benchmark from the NAS Parallel 
suite. And at the same time run 2 stress cpu burn processes.  The GROUP test puts the 
benchmark and the stress processes each in its own cgroup.  The NORMAL case puts them all 
in the first cgroup.  The results show first the average number of threads of each type 
running on each of the numa nodes based on sampling taken during the run.  This is followed
by the lu.C benchmark results. There are 3 runs of GROUP and 2 runs of NORMAL shown.

Before (linux-5.3-rc1+  @  a1dc0446d649)

lu.C.x_76_GROUP_1.stress.ps.numa.hist   Average    0.00  1.00  1.00
lu.C.x_76_GROUP_2.stress.ps.numa.hist   Average    0.00  1.00  1.00
lu.C.x_76_GROUP_3.stress.ps.numa.hist   Average    0.00  1.00  1.00
lu.C.x_76_NORMAL_1.stress.ps.numa.hist  Average    1.15  0.23  0.00  0.62
lu.C.x_76_NORMAL_2.stress.ps.numa.hist  Average    1.67  0.00  0.00  0.33

lu.C.x_76_GROUP_1.ps.numa.hist   Average    30.45  6.95  4.52  34.08
lu.C.x_76_GROUP_2.ps.numa.hist   Average    32.33  8.94  9.21  25.52
lu.C.x_76_GROUP_3.ps.numa.hist   Average    30.45  8.91  12.09  24.55
lu.C.x_76_NORMAL_1.ps.numa.hist  Average    18.54  19.23  19.69  18.54
lu.C.x_76_NORMAL_2.ps.numa.hist  Average    17.25  19.83  20.00  18.92

============76_GROUP========Mop/s===================================
min	q1	median	q3	max
2119.92	2418.1	2716.28	3147.82	3579.36
============76_GROUP========time====================================
min	q1	median	q3	max
569.65	660.155	750.66	856.245	961.83
============76_NORMAL========Mop/s===================================
min	q1	median	q3	max
30424.5	31486.4	31486.4	31486.4	32548.4
============76_NORMAL========time====================================
min	q1	median	q3	max
62.65	64.835	64.835	64.835	67.02


After (linux-5.3-rc1+  @  a1dc0446d649 + this v2 series pulled from 
Vincent's git on ~8/15)

lu.C.x_76_GROUP_1.stress.ps.numa.hist   Average    0.36  1.00  0.64
lu.C.x_76_GROUP_2.stress.ps.numa.hist   Average    1.00  1.00
lu.C.x_76_GROUP_3.stress.ps.numa.hist   Average    1.00  1.00
lu.C.x_76_NORMAL_1.stress.ps.numa.hist  Average    0.23  0.15  0.31  1.31
lu.C.x_76_NORMAL_2.stress.ps.numa.hist  Average    1.00  0.00  0.00  1.00

lu.C.x_76_GROUP_1.ps.numa.hist   Average    18.91  18.36  18.91  19.82
lu.C.x_76_GROUP_2.ps.numa.hist   Average    18.36  18.00  19.91  19.73
lu.C.x_76_GROUP_3.ps.numa.hist   Average    18.17  18.42  19.25  20.17
lu.C.x_76_NORMAL_1.ps.numa.hist  Average    19.08  20.00  18.62  18.31
lu.C.x_76_NORMAL_2.ps.numa.hist  Average    18.09  19.91  19.18  18.82

============76_GROUP========Mop/s===================================
min	q1	median	q3	max
32304.1	33176	34047.9	34166.8	34285.7
============76_GROUP========time====================================
min	q1	median	q3	max
59.47	59.68	59.89	61.505	63.12
============76_NORMAL========Mop/s===================================
min	q1	median	q3	max
29825.5	32454	32454	32454	35082.5
============76_NORMAL========time====================================
min	q1	median	q3	max
58.12	63.24	63.24	63.24	68.36


I had initially tracked this down to two issues. The first was picking the wrong
group in find_busiest_group due to using the average load. The second was in 
fix_small_imbalance(). The "load" of the lu.C tasks was so low it often failed 
to move anything even when it did find a group that was overloaded (nr_running 
> width). I have two small patches which fix this but since Vincent was embarking
on a re-work which also addressed this I dropped them.

We've also run a series of performance tests we use to check for regressions and 
did not find any bad results on our workloads and systems.

So...

Tested-by: Phil Auld <pauld@redhat.com>


Cheers,
Phil
-- 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/8] sched/fair: rework the CFS load balance
  2019-08-29 19:23 ` [PATCH v2 0/8] sched/fair: rework the CFS load balance Phil Auld
@ 2019-08-30  6:46   ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-08-30  6:46 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

Hi Phil,

On Thu, 29 Aug 2019 at 21:23, Phil Auld <pauld@redhat.com> wrote:
>
> On Thu, Aug 01, 2019 at 04:40:16PM +0200 Vincent Guittot wrote:
> > Several wrong task placement have been raised with the current load

> >
> > --
> > 2.7.4
> >
>
> I keep expecting a v3 so I have not dug into all the patches in detail. However, I've

v3 is under preparation

> been working with them from Vincent's tree while they were under development so I thought
> I'd add some results.

Yes. thanks for your help.

>
> The workload is a test our perf team came up with to illustrate the issues we were seeing
> with imbalance in the presence of group scheduling.
>
> On a 4-numa X 20 cpu system (smt on) we run a 76 thread lu.C benchmark from the NAS Parallel
> suite. And at the same time run 2 stress cpu burn processes.  The GROUP test puts the
> benchmark and the stress processes each in its own cgroup.  The NORMAL case puts them all
> in the first cgroup.  The results show first the average number of threads of each type
> running on each of the numa nodes based on sampling taken during the run.  This is followed
> by the lu.C benchmark results. There are 3 runs of GROUP and 2 runs of NORMAL shown.
>
> Before (linux-5.3-rc1+  @  a1dc0446d649)
>
> lu.C.x_76_GROUP_1.stress.ps.numa.hist   Average    0.00  1.00  1.00
> lu.C.x_76_GROUP_2.stress.ps.numa.hist   Average    0.00  1.00  1.00
> lu.C.x_76_GROUP_3.stress.ps.numa.hist   Average    0.00  1.00  1.00
> lu.C.x_76_NORMAL_1.stress.ps.numa.hist  Average    1.15  0.23  0.00  0.62
> lu.C.x_76_NORMAL_2.stress.ps.numa.hist  Average    1.67  0.00  0.00  0.33
>
> lu.C.x_76_GROUP_1.ps.numa.hist   Average    30.45  6.95  4.52  34.08
> lu.C.x_76_GROUP_2.ps.numa.hist   Average    32.33  8.94  9.21  25.52
> lu.C.x_76_GROUP_3.ps.numa.hist   Average    30.45  8.91  12.09  24.55
> lu.C.x_76_NORMAL_1.ps.numa.hist  Average    18.54  19.23  19.69  18.54
> lu.C.x_76_NORMAL_2.ps.numa.hist  Average    17.25  19.83  20.00  18.92
>
> ============76_GROUP========Mop/s===================================
> min     q1      median  q3      max
> 2119.92 2418.1  2716.28 3147.82 3579.36
> ============76_GROUP========time====================================
> min     q1      median  q3      max
> 569.65  660.155 750.66  856.245 961.83
> ============76_NORMAL========Mop/s===================================
> min     q1      median  q3      max
> 30424.5 31486.4 31486.4 31486.4 32548.4
> ============76_NORMAL========time====================================
> min     q1      median  q3      max
> 62.65   64.835  64.835  64.835  67.02
>
>
> After (linux-5.3-rc1+  @  a1dc0446d649 + this v2 series pulled from
> Vincent's git on ~8/15)
>
> lu.C.x_76_GROUP_1.stress.ps.numa.hist   Average    0.36  1.00  0.64
> lu.C.x_76_GROUP_2.stress.ps.numa.hist   Average    1.00  1.00
> lu.C.x_76_GROUP_3.stress.ps.numa.hist   Average    1.00  1.00
> lu.C.x_76_NORMAL_1.stress.ps.numa.hist  Average    0.23  0.15  0.31  1.31
> lu.C.x_76_NORMAL_2.stress.ps.numa.hist  Average    1.00  0.00  0.00  1.00
>
> lu.C.x_76_GROUP_1.ps.numa.hist   Average    18.91  18.36  18.91  19.82
> lu.C.x_76_GROUP_2.ps.numa.hist   Average    18.36  18.00  19.91  19.73
> lu.C.x_76_GROUP_3.ps.numa.hist   Average    18.17  18.42  19.25  20.17
> lu.C.x_76_NORMAL_1.ps.numa.hist  Average    19.08  20.00  18.62  18.31
> lu.C.x_76_NORMAL_2.ps.numa.hist  Average    18.09  19.91  19.18  18.82
>
> ============76_GROUP========Mop/s===================================
> min     q1      median  q3      max
> 32304.1 33176   34047.9 34166.8 34285.7
> ============76_GROUP========time====================================
> min     q1      median  q3      max
> 59.47   59.68   59.89   61.505  63.12
> ============76_NORMAL========Mop/s===================================
> min     q1      median  q3      max
> 29825.5 32454   32454   32454   35082.5
> ============76_NORMAL========time====================================
> min     q1      median  q3      max
> 58.12   63.24   63.24   63.24   68.36
>
>
> I had initially tracked this down to two issues. The first was picking the wrong
> group in find_busiest_group due to using the average load. The second was in
> fix_small_imbalance(). The "load" of the lu.C tasks was so low it often failed
> to move anything even when it did find a group that was overloaded (nr_running
> > width). I have two small patches which fix this but since Vincent was embarking
> on a re-work which also addressed this I dropped them.
>
> We've also run a series of performance tests we use to check for regressions and
> did not find any bad results on our workloads and systems.
>
> So...
>
> Tested-by: Phil Auld <pauld@redhat.com>

Thanks for testing

Vincent

>
>
> Cheers,
> Phil
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/8] sched/fair: rework load_balance
  2019-08-29 14:26         ` Vincent Guittot
@ 2019-08-30 14:33           ` Valentin Schneider
  0 siblings, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-08-30 14:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen

On 29/08/2019 15:26, Vincent Guittot wrote:
[...]
>> Seeing how much stuff we already do in just computing the stats, do we
>> really save that much by doing this? I'd expect it to be negligible with
>> modern architectures and all of the OoO/voodoo, but maybe I need a
>> refresher course.
> 
> We are not only running on top/latest architecture
> 

I know, and I'm not going to argue for a mere division either. I think I
made my point.

[...]>>>>> +     if (busiest->group_type == group_misfit_task) {
>>>>> +             /* Set imbalance to allow misfit task to be balanced. */
>>>>> +             env->balance_type = migrate_misfit;
>>>>> +             env->imbalance = busiest->group_misfit_task_load;
>>>>
>>>> AFAICT we don't ever use this value, other than setting it to 0 in
>>>> detach_tasks(), so what we actually set it to doesn't matter (as long as
>>>> it's > 0).
>>>
>>> not yet.
>>> it's only in patch 8/8 that we check if the tasks fits the cpu's
>>> capacity during the detach_tasks
>>>
>>
>> But that doesn't use env->imbalance, right? With that v3 patch it's just
>> the task util's, so AFAICT my comment still stands.
> 
> no, misfit case keeps using load and imbalance like the current
> implementation in this patch.
> The modifications on the way to handle misfit task are all in patch 8
> 
Right, my reply was a bit too terse. What I meant is that with patch 8 the
value of env->imbalance is irrelevant when dealing with misfit tasks - we
only check the task's utilization in detach_tasks(), we don't do any
comparison of the task's signals with env->imbalance.

Whether we set the imbalance to the load value and set it to 0 in
detach_tasks() or set it to 1 and decrement it in detach_tasks() gives the
same result. That's why I was saying it conceptually fits with the
migrate_task logic, since we can set the imbalance to 1 (we only want to
migrate one task).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load
       [not found] ` <20190809052124.13016-1-hdanton@sina.com>
@ 2019-09-02 13:07   ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2019-09-02 13:07 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen

Hi Hillf,

Sorry for the late reply.
I have noticed that i didn't answer your question while preparing v3

On Fri, 9 Aug 2019 at 07:21, Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Thu,  1 Aug 2019 16:40:21 +0200 Vincent Guittot wrote:
> >
> > cfs load_balance only takes care of CFS tasks whereas CPUs can be used by
> > other scheduling class. Typically, a CFS task preempted by a RT or deadline
> > task will not get a chance to be pulled on another CPU because the
> > load_balance doesn't take into account tasks from classes.
>
> We can add something accordingly in RT to push cfs tasks to another cpu
> in this direction if the pulling above makes some sense missed long.

RT class doesn't and can't touch CFS tasks but the ilb will be kicked
to check if another CPU can pull the CFS task.

> I doubt we can as we can not do too much about RT tasks on any cpu.
> Nor is busiest cpu selected for load balancing based on a fifo cpuhog.

This patch takes into account all type tasks when checking the state
of a group and when trying to balance the number of tasks but of
course we can only detach and move the cfs tasks at the end.

So if we have 1 RT task and 1 CFS task competing for the same CPU but
there is an idle CPU, the CFS task will be pulled during the
load_balance of the idle CPU whereas it was not the case before.

>
> > Add sum of nr_running in the statistics and use it to detect such
> > situation.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a8681c3..f05f1ad 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7774,6 +7774,7 @@ struct sg_lb_stats {
> >       unsigned long group_load; /* Total load over the CPUs of the group */
> >       unsigned long group_capacity;
> >       unsigned long group_util; /* Total utilization of the group */
> > +     unsigned int sum_nr_running; /* Nr tasks running in the group */
> >       unsigned int sum_h_nr_running; /* Nr tasks running in the group */
>
> A different comment is appreciated.

ok

>
> >       unsigned int idle_cpus;
> >       unsigned int group_weight;
> > @@ -8007,7 +8008,7 @@ static inline int sg_imbalanced(struct sched_group *group)
> >  static inline bool
> >  group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
> >  {
> > -     if (sgs->sum_h_nr_running < sgs->group_weight)
> > +     if (sgs->sum_nr_running < sgs->group_weight)
> >               return true;
> >
> >       if ((sgs->group_capacity * 100) >
> > @@ -8028,7 +8029,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
> >  static inline bool
> >  group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
> >  {
> > -     if (sgs->sum_h_nr_running <= sgs->group_weight)
> > +     if (sgs->sum_nr_running <= sgs->group_weight)
> >               return false;
> >
> >       if ((sgs->group_capacity * 100) <
> > @@ -8132,6 +8133,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >               sgs->sum_h_nr_running += rq->cfs.h_nr_running;
> >
> >               nr_running = rq->nr_running;
> > +             sgs->sum_nr_running += nr_running;
> > +
> >               if (nr_running > 1)
> >                       *sg_status |= SG_OVERLOAD;
> >
> > @@ -8480,7 +8483,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                        * groups.
> >                        */
> >                       env->balance_type = migrate_task;
> > -                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> > +                     env->imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1;
>
> Can we detach RR tasks?
>
> >                       return;
> >               }
> >
> > @@ -8643,7 +8646,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >
> >       /* Try to move all excess tasks to child's sibling domain */
> >       if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > -         busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> > +         busiest->sum_nr_running > local->sum_nr_running + 1)
> >               goto force_balance;
> >
> >       if (busiest->group_type != group_overloaded &&
> > --
> > 2.7.4
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-09-02 13:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 1/8] sched/fair: clean up asym packing Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 2/8] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 3/8] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
2019-08-05 17:07   ` Valentin Schneider
2019-08-26  9:26     ` Vincent Guittot
2019-08-28 10:25       ` Valentin Schneider
2019-08-06 15:56   ` Peter Zijlstra
2019-08-26  9:31     ` Vincent Guittot
2019-08-06 17:17   ` Valentin Schneider
2019-08-07 11:16     ` Valentin Schneider
2019-08-26 10:11     ` Vincent Guittot
2019-08-28 14:19       ` Valentin Schneider
2019-08-29 14:26         ` Vincent Guittot
2019-08-30 14:33           ` Valentin Schneider
2019-08-01 14:40 ` [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 6/8] sched/fair: use load instead of runnable load Vincent Guittot
2019-08-06 16:07   ` Peter Zijlstra
2019-08-26 15:45     ` Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 7/8] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 8/8] sched/fair: use utilization to select misfit task Vincent Guittot
2019-08-01 16:27   ` Valentin Schneider
2019-08-02  8:29     ` Vincent Guittot
2019-08-02 10:49       ` Valentin Schneider
2019-08-02 12:56   ` [PATCH v3] " Vincent Guittot
2019-08-02 14:27     ` Valentin Schneider
2019-08-05 11:01     ` Valentin Schneider
2019-08-29 19:23 ` [PATCH v2 0/8] sched/fair: rework the CFS load balance Phil Auld
2019-08-30  6:46   ` Vincent Guittot
     [not found] ` <20190809052124.13016-1-hdanton@sina.com>
2019-09-02 13:07   ` [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).