linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] remove runnable_load_avg and improve group_classify
@ 2020-02-11 17:46 Vincent Guittot
  2020-02-11 17:46 ` [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path Vincent Guittot
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-11 17:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, Vincent Guittot

NUMA load balancing is the last remaining piece of code that uses the 
runnable_load_avg of PELT to balance tasks between nodes. The normal
load_balance has replaced it by a better description of the current state
of the group of cpus.  The same policy can be applied to the numa
balancing.

Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
signal that tracks the waiting time of tasks on rq. Currently, the state
of a group of CPUs is defined thanks to the number of running task and the
level of utilization of rq. But the utilization can be temporarly low
after the migration of a task whereas the rq is still overloaded with
tasks. In such case where tasks were competing for the rq, the
runnable_avg will stay high after the migration.

Some hackbench results:

- small arm64 dual quad cores system
hackbench -l (2560/#grp) -g #grp

grp    tip/sched/core         +patchset              improvement
1       1,327(+/-10,06 %)     1,247(+/-5,45 %)       5,97 %
4       1,250(+/- 2,55 %)     1,207(+/-2,12 %)       3,42 %
8       1,189(+/- 1,47 %)     1,179(+/-1,93 %)       0,90 %
16      1,221(+/- 3,25 %)     1,219(+/-2,44 %)       0,16 %						

- large arm64 2 nodes / 224 cores system
hackbench -l (256000/#grp) -g #grp

grp    tip/sched/core         +patchset              improvement
1      14,197(+/- 2,73 %)     13,917(+/- 2,19 %)     1,98 %
4       6,817(+/- 1,27 %)      6,523(+/-11,96 %)     4,31 %
16      2,930(+/- 1,07 %)      2,911(+/- 1,08 %)     0,66 %
32      2,735(+/- 1,71 %)      2,725(+/- 1,53 %)     0,37 %
64      2,702(+/- 0,32 %)      2,717(+/- 1,07 %)    -0,53 %
128     3,533(+/-14,66 %)     3,123(+/-12,47 %)     11,59 %
256     3,918(+/-19,93 %)     3,390(+/- 5,93 %)     13,47 %

The significant improvement for 128 and 256 should be taken with care
because of some instabilities over iterations without the patchset.

The table below shows figures of the classification of sched group during
load balance (idle, newly or busy lb) with the disribution according to
the number of running tasks for:
    hackbench -l 640 -g 4 on octo cores

                 tip/sched/core  +patchset
state
has spare            3973        1934	
        nr_running					
            0        1965        1858
            1         518          56
            2         369          18
            3         270           2
            4+        851           0
						
fully busy            546        1018	
        nr_running					
            0           0           0
            1         546        1018
            2           0           0
            3           0           0
            4+          0           0
						
overloaded           2109        3056	
        nr_running					
            0           0           0
            1           0           0
            2         241         483
            3         170         348
            4+       1698        2225

total                6628        6008	

Without the patchset, there is a significant number of time that a CPU has
spare capacity with more than 1 running task. Although this is a valid
case, this is not a state that should often happen when 160 tasks are
competing on 8 cores like for this test. The patchset fixes the situation
by taking into account the runnable_avg, which stays high after the
migration of a task on another CPU.

Vincent Guittot (4):
  sched/fair: reorder enqueue/dequeue_task_fair path
  sched/numa: replace runnable_load_avg by load_avg
  sched/fair: replace runnable load average by runnable average
  sched/fair: Take into runnable_avg to classify group

 include/linux/sched.h |  17 ++-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |  17 +--
 kernel/sched/fair.c   | 335 ++++++++++++++++++++++--------------------
 kernel/sched/pelt.c   |  45 +++---
 kernel/sched/sched.h  |  29 +++-
 6 files changed, 241 insertions(+), 204 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path
  2020-02-11 17:46 [PATCH 0/4] remove runnable_load_avg and improve group_classify Vincent Guittot
@ 2020-02-11 17:46 ` Vincent Guittot
  2020-02-12 13:20   ` Mel Gorman
  2020-02-11 17:46 ` [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg Vincent Guittot
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-11 17:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, Vincent Guittot

The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.

Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.

In addition, the same steps ordering is used when updating a cfs_rq:
- update_load_avg
- update_cfs_group
- update *h_nr_running

No functional and performance changes are expected and have been noticed
during tests.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a0ce83e835a..a1ea02b5362e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5259,32 +5259,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		 */
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto enqueue_throttle;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running++;
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 	}
 
+enqueue_throttle:
 	if (!se) {
 		add_nr_running(rq, 1);
 		/*
@@ -5345,17 +5344,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto dequeue_throttle;
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5373,16 +5368,19 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running--;
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 	}
 
+dequeue_throttle:
 	if (!se)
 		sub_nr_running(rq, 1);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-11 17:46 [PATCH 0/4] remove runnable_load_avg and improve group_classify Vincent Guittot
  2020-02-11 17:46 ` [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path Vincent Guittot
@ 2020-02-11 17:46 ` Vincent Guittot
  2020-02-12 13:37   ` Mel Gorman
  2020-02-11 17:46 ` [RFC 3/4] sched/fair: replace runnable load average by runnable average Vincent Guittot
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-11 17:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, Vincent Guittot

Similarly to what has been done for the normal load balance, we can
replace runnable_load_avg by load_avg in numa load balancing and track
other statistics like the utilization and the number of running tasks to
get to better view of the current state of a node.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 102 ++++++++++++++++++++++++++++++++------------
 1 file changed, 75 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1ea02b5362e..6e4c2b012c48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
-
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
+/*
+ * 'numa_type' describes the node at the moment of load balancing.
+ */
+enum numa_type {
+	/* The node has spare capacity that can be used to run more tasks.  */
+	node_has_spare = 0,
+	/*
+	 * The node is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevertheless, some tasks might wait before running.
+	 */
+	node_fully_busy,
+	/*
+	 * The node is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
+	node_overloaded
+};
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
 	unsigned long load;
-
+	unsigned long util;
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
+	unsigned int nr_running;
+	unsigned int weight;
+	enum numa_type node_type;
 };
 
-/*
- * XXX borrowed from update_sg_lb_stats
- */
-static void update_numa_stats(struct numa_stats *ns, int nid)
-{
-	int cpu;
-
-	memset(ns, 0, sizeof(*ns));
-	for_each_cpu(cpu, cpumask_of_node(nid)) {
-		struct rq *rq = cpu_rq(cpu);
-
-		ns->load += cpu_runnable_load(rq);
-		ns->compute_capacity += capacity_of(cpu);
-	}
-
-}
-
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -1521,6 +1518,47 @@ struct task_numa_env {
 	int best_cpu;
 };
 
+static unsigned long cpu_load(struct rq *rq);
+static unsigned long cpu_util(int cpu);
+
+static inline enum
+numa_type numa_classify(unsigned int imbalance_pct,
+			 struct numa_stats *ns)
+{
+	if ((ns->nr_running > ns->weight) &&
+	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
+		return node_overloaded;
+
+	if ((ns->nr_running < ns->weight) ||
+	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
+		return node_has_spare;
+
+	return node_fully_busy;
+}
+
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct task_numa_env *env,
+			      struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->load += cpu_load(rq);
+		ns->util += cpu_util(cpu);
+		ns->nr_running += rq->cfs.h_nr_running;
+		ns->compute_capacity += capacity_of(cpu);
+	}
+
+	ns->weight = cpumask_weight(cpumask_of_node(nid));
+
+	ns->node_type = numa_classify(env->imbalance_pct, ns);
+}
+
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
@@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
+
+	/* If dst node has spare capacity, there is no real load imbalance */
+	if (env->dst_stats.node_type == node_has_spare)
+		return false;
+
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env.src_stats, env.src_nid);
+	update_numa_stats(&env, &env.src_stats, env.src_nid);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env.dst_stats, env.dst_nid);
+	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env.dst_stats, env.dst_nid);
+			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
@@ -5446,6 +5489,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
+static unsigned long cpu_runnable_load(struct rq *rq)
+{
+	return cfs_rq_runnable_load_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 3/4] sched/fair: replace runnable load average by runnable average
  2020-02-11 17:46 [PATCH 0/4] remove runnable_load_avg and improve group_classify Vincent Guittot
  2020-02-11 17:46 ` [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path Vincent Guittot
  2020-02-11 17:46 ` [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg Vincent Guittot
@ 2020-02-11 17:46 ` Vincent Guittot
  2020-02-12 14:30   ` Mel Gorman
  2020-02-13 17:36   ` Peter Zijlstra
  2020-02-11 17:46 ` [RFC 4/4] sched/fair: Take into runnable_avg to classify group Vincent Guittot
  2020-02-11 21:04 ` [PATCH 0/4] remove runnable_load_avg and improve group_classify Mel Gorman
  4 siblings, 2 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-11 17:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, Vincent Guittot

Now that runnable_load_avg is not more used, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
  A rq with more that around 80% of utilization and more than 1 tasks is
  considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnbale _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  17 +++-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |  17 ++--
 kernel/sched/fair.c   | 185 ++++++++++++++++--------------------------
 kernel/sched/pelt.c   |  45 +++++-----
 kernel/sched/sched.h  |  29 +++++--
 6 files changed, 138 insertions(+), 157 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 467d26046416..a49db03d56bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -357,7 +357,7 @@ struct util_est {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
- * (see __update_load_avg() in kernel/sched/fair.c).
+ * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
@@ -367,6 +367,14 @@ struct util_est {
  * For cfs_rq, it is the aggregated load_avg of all runnable and
  * blocked sched_entities.
  *
+ * [runnable_avg definition]
+ *
+ *   runnable_avg = runnable%
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated runnable_avg of all runnable and
+ * blocked sched_entities.
+ *
  * [util_avg definition]
  *
  *   util_avg = running% * SCHED_CAPACITY_SCALE
@@ -401,11 +409,11 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
-	u64				runnable_load_sum;
+	u64				runnable_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
-	unsigned long			runnable_load_avg;
+	unsigned long			runnable_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -449,7 +457,6 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
-	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
@@ -470,6 +477,8 @@ struct sched_entity {
 	struct cfs_rq			*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq			*my_q;
+	/* cached value of my_q->h_nr_running */
+	unsigned long			runnable_weight;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 45f79bcc3146..5d1bec3d8ce7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -761,7 +761,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	if (task_has_idle_policy(p)) {
 		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
-		p->se.runnable_weight = load->weight;
 		return;
 	}
 
@@ -774,7 +773,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	} else {
 		load->weight = scale_load(sched_prio_to_weight[prio]);
 		load->inv_weight = sched_prio_to_wmult[prio];
-		p->se.runnable_weight = load->weight;
 	}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 879d3ccf3806..8331bc04aea2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -402,11 +402,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	}
 
 	P(se->load.weight);
-	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
-	P(se->avg.runnable_load_avg);
+	P(se->avg.runnable_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,11 +523,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->avg.runnable_load_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
+			cfs_rq->avg.runnable_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -537,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
+			cfs_rq->removed.runnable_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
@@ -947,13 +945,12 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
-	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
-	P(se.avg.runnable_load_sum);
+	P(se.avg.runnable_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
-	P(se.avg.runnable_load_avg);
+	P(se.avg.runnable_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e4c2b012c48..7d7cb207be30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,10 +740,10 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * Group entities are initialized with zero load to reflect the fact that
 	 * nothing has been attached to the task group yet.
 	 */
-	if (entity_is_task(se))
-		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-
-	se->runnable_weight = se->load.weight;
+	if (entity_is_task(se)) {
+		sa->runnable_avg = SCHED_CAPACITY_SCALE;
+		sa->load_avg = scale_load_down(se->load.weight);
+	}
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -2877,25 +2877,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 } while (0)
 
 #ifdef CONFIG_SMP
-static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight += se->runnable_weight;
-
-	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
-}
-
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight -= se->runnable_weight;
-
-	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
-	sub_positive(&cfs_rq->avg.runnable_load_sum,
-		     se_runnable(se) * se->avg.runnable_load_sum);
-}
-
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -2911,28 +2892,22 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 #else
 static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight, unsigned long runnable)
+			    unsigned long weight)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
 		account_entity_dequeue(cfs_rq, se);
-		dequeue_runnable_load_avg(cfs_rq, se);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
@@ -2940,15 +2915,12 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-		se->avg.runnable_load_avg =
-			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 	} while (0);
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
-		enqueue_runnable_load_avg(cfs_rq, se);
 	}
 }
 
@@ -2959,7 +2931,7 @@ void reweight_task(struct task_struct *p, int prio)
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight, weight);
+	reweight_entity(cfs_rq, se, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -3071,50 +3043,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	 */
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
-
-/*
- * This calculates the effective runnable weight for a group entity based on
- * the group entity weight calculated above.
- *
- * Because of the above approximation (2), our group entity weight is
- * an load_avg based ratio (3). This means that it includes blocked load and
- * does not represent the runnable weight.
- *
- * Approximate the group entity's runnable weight per ratio from the group
- * runqueue:
- *
- *					     grq->avg.runnable_load_avg
- *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
- *						 grq->avg.load_avg
- *
- * However, analogous to above, since the avg numbers are slow, this leads to
- * transients in the from-idle case. Instead we use:
- *
- *   ge->runnable_weight = ge->load.weight *
- *
- *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
- *		-----------------------------------------------------	(8)
- *		      max(grq->avg.load_avg, grq->load.weight)
- *
- * Where these max() serve both to use the 'instant' values to fix the slow
- * from-idle and avoid the /0 on to-idle, similar to (6).
- */
-static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
-{
-	long runnable, load_avg;
-
-	load_avg = max(cfs_rq->avg.load_avg,
-		       scale_load_down(cfs_rq->load.weight));
-
-	runnable = max(cfs_rq->avg.runnable_load_avg,
-		       scale_load_down(cfs_rq->runnable_weight));
-
-	runnable *= shares;
-	if (load_avg)
-		runnable /= load_avg;
-
-	return clamp_t(long, runnable, MIN_SHARES, shares);
-}
 #endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3126,7 +3054,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares, runnable;
+	long shares;
 
 	if (!gcfs_rq)
 		return;
@@ -3135,16 +3063,15 @@ static void update_cfs_group(struct sched_entity *se)
 		return;
 
 #ifndef CONFIG_SMP
-	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
+	shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
 	shares   = calc_group_shares(gcfs_rq);
-	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares, runnable);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -3269,11 +3196,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial and simply copies the running
- * sum over (but still wrong, because the group entity and group rq do not have
- * their PELT windows aligned).
+ * Per the above update_tg_cfs_util() and update_tg_cfs_runnable() are trivial
+ * and simply copies the running/runnable sum over (but still wrong, because
+ * the group entity and group rq do not have their PELT windows aligned).
  *
- * However, update_tg_cfs_runnable() is more complex. So we have:
+ * However, update_tg_cfs_load() is more complex. So we have:
  *
  *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
  *
@@ -3355,10 +3282,36 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+{
+	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
+	/* Set new sched_entity's runnable */
+	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
+	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq runnable */
+	add_positive(&cfs_rq->avg.runnable_avg, delta);
+	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
+}
+
+static inline void
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
-	unsigned long runnable_load_avg, load_avg;
-	u64 runnable_load_sum, load_sum = 0;
+	unsigned long load_avg;
+	u64 load_sum = 0;
 	s64 delta_sum;
 
 	if (!runnable_sum)
@@ -3406,20 +3359,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.load_avg = load_avg;
 	add_positive(&cfs_rq->avg.load_avg, delta_avg);
 	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-
-	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
-	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
-
-	if (se->on_rq) {
-		delta_sum = runnable_load_sum -
-				se_weight(se) * se->avg.runnable_load_sum;
-		delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
-	}
-
-	se->avg.runnable_load_sum = runnable_sum;
-	se->avg.runnable_load_avg = runnable_load_avg;
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3448,6 +3387,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
 	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
 	trace_pelt_se_tp(se);
@@ -3517,7 +3457,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3528,7 +3468,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
+		swap(cfs_rq->removed.runnable_avg, removed_runnable);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3540,7 +3480,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
+		r = removed_runnable;
+		sub_positive(&sa->runnable_avg, r);
+		sub_positive(&sa->runnable_sum, r * divider);
+
+		/*
+		 * removed_runnable is the unweighted version of removed_load so we
+		 * can use it to estimate removed_load_sum.
+		 */
+		add_tg_cfs_propagate(cfs_rq,
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
 
 		decayed = 1;
 	}
@@ -3586,17 +3535,19 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 */
 	se->avg.util_sum = se->avg.util_avg * divider;
 
+	se->avg.runnable_sum = se->avg.runnable_avg * divider;
+
 	se->avg.load_sum = divider;
 	if (se_weight(se)) {
 		se->avg.load_sum =
 			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
 	}
 
-	se->avg.runnable_load_sum = se->avg.load_sum;
-
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	cfs_rq->avg.runnable_avg += se->avg.runnable_avg;
+	cfs_rq->avg.runnable_sum += se->avg.runnable_sum;
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
@@ -3618,6 +3569,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
+	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
@@ -3724,13 +3677,13 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
-	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
+	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
+static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->avg.runnable_load_avg;
+	return cfs_rq->avg.runnable_avg;
 }
 
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
@@ -4065,8 +4018,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	se_update_runnable(se);
 	update_cfs_group(se);
-	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -4149,7 +4102,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
-	dequeue_runnable_load_avg(cfs_rq, se);
+	se_update_runnable(se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -5320,6 +5273,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running++;
@@ -5417,6 +5371,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running--;
@@ -5489,9 +5444,9 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
-static unsigned long cpu_runnable_load(struct rq *rq)
+static unsigned long cpu_runnable(struct rq *rq)
 {
-	return cfs_rq_runnable_load_avg(&rq->cfs);
+	return cfs_rq_runnable_avg(&rq->cfs);
 }
 
 static unsigned long capacity_of(int cpu)
@@ -7597,7 +7552,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->avg.runnable_load_sum)
+	if (cfs_rq->avg.runnable_sum)
 		return false;
 
 	return true;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b79b360..d6ec21450ffa 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -121,8 +121,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
+		sa->runnable_sum =
+			decay_load(sa->runnable_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -149,7 +149,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	if (load)
 		sa->load_sum += load * contrib;
 	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
+		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -246,7 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
+	sa->runnable_avg =	div_u64(runnable * sa->runnable_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -254,33 +254,34 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  * sched_entity:
  *
  *   task:
- *     se_runnable() == se_weight()
+ *     se_weight()   = se->load.weight
+ *     se_runnable() = !!on_rq
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *     se_runnable() = grq->h_nr_running
  *
- *   load_sum := runnable_sum
- *   load_avg = se_weight(se) * runnable_avg
+ *   runnable_sum = se_runnable() * runnable = grq->runnable_sum
+ *   runnable_avg = runnable_sum
  *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *   load_sum := runnable
+ *   load_avg = se_weight(se) * load_sum
  *
  * XXX collapse load_sum and runnable_load_sum
  *
  * cfq_rq:
  *
+ *   runnable_sum = \Sum se->avg.runnable_sum
+ *   runnable_avg = \Sum se->avg.runnable_avg
+ *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
 	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se), 1);
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -290,10 +291,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
 				cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se), 1);
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -306,7 +307,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
+				cfs_rq->h_nr_running,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1, 1);
@@ -322,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
  *
  *   load_avg and runnable_load_avg are not supported and meaningless.
  *
@@ -348,7 +349,9 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
@@ -373,7 +376,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0844e81964e5..9811563df9b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -489,7 +489,6 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
@@ -528,7 +527,7 @@ struct cfs_rq {
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
-		unsigned long	runnable_sum;
+		unsigned long	runnable_avg;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -688,8 +687,30 @@ struct dl_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
+
+static inline void se_update_runnable(struct sched_entity *se)
+{
+	if (!entity_is_task(se))
+		se->runnable_weight = se->my_q->h_nr_running;
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return !!se->on_rq;
+	else
+		return se->runnable_weight;
+}
+
 #else
 #define entity_is_task(se)	1
+
+static inline void se_update_runnable(struct sched_entity *se) {}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return !!se->on_rq;
+}
 #endif
 
 #ifdef CONFIG_SMP
@@ -701,10 +722,6 @@ static inline long se_weight(struct sched_entity *se)
 	return scale_load_down(se->load.weight);
 }
 
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
 
 static inline bool sched_asym_prefer(int a, int b)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 4/4] sched/fair: Take into runnable_avg to classify group
  2020-02-11 17:46 [PATCH 0/4] remove runnable_load_avg and improve group_classify Vincent Guittot
                   ` (2 preceding siblings ...)
  2020-02-11 17:46 ` [RFC 3/4] sched/fair: replace runnable load average by runnable average Vincent Guittot
@ 2020-02-11 17:46 ` Vincent Guittot
  2020-02-13 18:32   ` Valentin Schneider
  2020-02-11 21:04 ` [PATCH 0/4] remove runnable_load_avg and improve group_classify Mel Gorman
  4 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-11 17:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, Vincent Guittot

Take into account the new runnable_avg signal to classify a group and to
mitigate the volatility of util_avg in face of intensive migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d7cb207be30..5f8f12c902d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7691,6 +7691,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
+	unsigned long group_runnable; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr of tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
@@ -7911,6 +7912,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return false;
+
 	if ((sgs->group_capacity * 100) >
 			(sgs->group_util * imbalance_pct))
 		return true;
@@ -7936,6 +7941,10 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 			(sgs->group_util * imbalance_pct))
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return true;
+
 	return false;
 }
 
@@ -8030,6 +8039,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
+		sgs->group_runnable += cpu_runnable(rq);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] remove runnable_load_avg and improve group_classify
  2020-02-11 17:46 [PATCH 0/4] remove runnable_load_avg and improve group_classify Vincent Guittot
                   ` (3 preceding siblings ...)
  2020-02-11 17:46 ` [RFC 4/4] sched/fair: Take into runnable_avg to classify group Vincent Guittot
@ 2020-02-11 21:04 ` Mel Gorman
  2020-02-12  8:16   ` Vincent Guittot
  4 siblings, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-11 21:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider

On Tue, Feb 11, 2020 at 06:46:47PM +0100, Vincent Guittot wrote:
> NUMA load balancing is the last remaining piece of code that uses the 
> runnable_load_avg of PELT to balance tasks between nodes. The normal
> load_balance has replaced it by a better description of the current state
> of the group of cpus.  The same policy can be applied to the numa
> balancing.
> 
> Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
> signal that tracks the waiting time of tasks on rq. Currently, the state
> of a group of CPUs is defined thanks to the number of running task and the
> level of utilization of rq. But the utilization can be temporarly low
> after the migration of a task whereas the rq is still overloaded with
> tasks. In such case where tasks were competing for the rq, the
> runnable_avg will stay high after the migration.
> 
> Some hackbench results:
> 
> - small arm64 dual quad cores system
> hackbench -l (2560/#grp) -g #grp
> 
> grp    tip/sched/core         +patchset              improvement
> 1       1,327(+/-10,06 %)     1,247(+/-5,45 %)       5,97 %
> 4       1,250(+/- 2,55 %)     1,207(+/-2,12 %)       3,42 %
> 8       1,189(+/- 1,47 %)     1,179(+/-1,93 %)       0,90 %
> 16      1,221(+/- 3,25 %)     1,219(+/-2,44 %)       0,16 %						
> 
> - large arm64 2 nodes / 224 cores system
> hackbench -l (256000/#grp) -g #grp
> 
> grp    tip/sched/core         +patchset              improvement
> 1      14,197(+/- 2,73 %)     13,917(+/- 2,19 %)     1,98 %
> 4       6,817(+/- 1,27 %)      6,523(+/-11,96 %)     4,31 %
> 16      2,930(+/- 1,07 %)      2,911(+/- 1,08 %)     0,66 %
> 32      2,735(+/- 1,71 %)      2,725(+/- 1,53 %)     0,37 %
> 64      2,702(+/- 0,32 %)      2,717(+/- 1,07 %)    -0,53 %
> 128     3,533(+/-14,66 %)     3,123(+/-12,47 %)     11,59 %
> 256     3,918(+/-19,93 %)     3,390(+/- 5,93 %)     13,47 %
> 

I haven't reviewed this yet because by co-incidence I'm finalising a
series that tries to reconcile the load balancer with the NUMA balancer
and it has been very tricky to get right.  One aspect though is that
hackbench is generally not long-running enough to detect any performance
regressions in NUMA balancing. At least I've never observed it to be a
good evaluation for NUMA balancing.

> Without the patchset, there is a significant number of time that a CPU has
> spare capacity with more than 1 running task. Although this is a valid
> case, this is not a state that should often happen when 160 tasks are
> competing on 8 cores like for this test. The patchset fixes the situation
> by taking into account the runnable_avg, which stays high after the
> migration of a task on another CPU.
> 

FWIW, during the rewrite, I ended up moving away from runnable_load to
get the load balancer and NUMA balancer to use the same metrics.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] remove runnable_load_avg and improve group_classify
  2020-02-11 21:04 ` [PATCH 0/4] remove runnable_load_avg and improve group_classify Mel Gorman
@ 2020-02-12  8:16   ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-12  8:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Tue, 11 Feb 2020 at 22:04, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:47PM +0100, Vincent Guittot wrote:
> > NUMA load balancing is the last remaining piece of code that uses the
> > runnable_load_avg of PELT to balance tasks between nodes. The normal
> > load_balance has replaced it by a better description of the current state
> > of the group of cpus.  The same policy can be applied to the numa
> > balancing.
> >
> > Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
> > signal that tracks the waiting time of tasks on rq. Currently, the state
> > of a group of CPUs is defined thanks to the number of running task and the
> > level of utilization of rq. But the utilization can be temporarly low
> > after the migration of a task whereas the rq is still overloaded with
> > tasks. In such case where tasks were competing for the rq, the
> > runnable_avg will stay high after the migration.
> >
> > Some hackbench results:
> >
> > - small arm64 dual quad cores system
> > hackbench -l (2560/#grp) -g #grp
> >
> > grp    tip/sched/core         +patchset              improvement
> > 1       1,327(+/-10,06 %)     1,247(+/-5,45 %)       5,97 %
> > 4       1,250(+/- 2,55 %)     1,207(+/-2,12 %)       3,42 %
> > 8       1,189(+/- 1,47 %)     1,179(+/-1,93 %)       0,90 %
> > 16      1,221(+/- 3,25 %)     1,219(+/-2,44 %)       0,16 %
> >
> > - large arm64 2 nodes / 224 cores system
> > hackbench -l (256000/#grp) -g #grp
> >
> > grp    tip/sched/core         +patchset              improvement
> > 1      14,197(+/- 2,73 %)     13,917(+/- 2,19 %)     1,98 %
> > 4       6,817(+/- 1,27 %)      6,523(+/-11,96 %)     4,31 %
> > 16      2,930(+/- 1,07 %)      2,911(+/- 1,08 %)     0,66 %
> > 32      2,735(+/- 1,71 %)      2,725(+/- 1,53 %)     0,37 %
> > 64      2,702(+/- 0,32 %)      2,717(+/- 1,07 %)    -0,53 %
> > 128     3,533(+/-14,66 %)     3,123(+/-12,47 %)     11,59 %
> > 256     3,918(+/-19,93 %)     3,390(+/- 5,93 %)     13,47 %
> >
>
> I haven't reviewed this yet because by co-incidence I'm finalising a
> series that tries to reconcile the load balancer with the NUMA balancer

That's interesting !
This series has been pending for a while and I have finally been able
to send it for review.*

> and it has been very tricky to get right.  One aspect though is that

I have been quite conservative in the policy as my main goal was not
to change all numa policy but mainly to remove the last user of
runnable_load_avg and i don't expect much behavior changes

> hackbench is generally not long-running enough to detect any performance
> regressions in NUMA balancing. At least I've never observed it to be a
> good evaluation for NUMA balancing.
>
> > Without the patchset, there is a significant number of time that a CPU has
> > spare capacity with more than 1 running task. Although this is a valid
> > case, this is not a state that should often happen when 160 tasks are
> > competing on 8 cores like for this test. The patchset fixes the situation
> > by taking into account the runnable_avg, which stays high after the
> > migration of a task on another CPU.
> >
>
> FWIW, during the rewrite, I ended up moving away from runnable_load to
> get the load balancer and NUMA balancer to use the same metrics.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path
  2020-02-11 17:46 ` [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path Vincent Guittot
@ 2020-02-12 13:20   ` Mel Gorman
  2020-02-12 14:47     ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 13:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider

On Tue, Feb 11, 2020 at 06:46:48PM +0100, Vincent Guittot wrote:
> The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> is split in 2 distinct parts for throttled cfs_rq without any added value
> but making code less readable.
> 
> Change the code ordering such that everything related to a cfs_rq
> (throttled or not) will be done in the same loop.
> 
> In addition, the same steps ordering is used when updating a cfs_rq:
> - update_load_avg
> - update_cfs_group
> - update *h_nr_running
> 
> No functional and performance changes are expected and have been noticed
> during tests.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
>  1 file changed, 20 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a0ce83e835a..a1ea02b5362e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5259,32 +5259,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		cfs_rq = cfs_rq_of(se);
>  		enqueue_entity(cfs_rq, se, flags);
>  
> -		/*
> -		 * end evaluation on encountering a throttled cfs_rq
> -		 *
> -		 * note: in the case of encountering a throttled cfs_rq we will
> -		 * post the final h_nr_running increment below.
> -		 */
> -		if (cfs_rq_throttled(cfs_rq))
> -			break;
>  		cfs_rq->h_nr_running++;
>  		cfs_rq->idle_h_nr_running += idle_h_nr_running;
>  
> +		/* end evaluation on encountering a throttled cfs_rq */
> +		if (cfs_rq_throttled(cfs_rq))
> +			goto enqueue_throttle;
> +
>  		flags = ENQUEUE_WAKEUP;
>  	}
>  
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
> -		cfs_rq->h_nr_running++;
> -		cfs_rq->idle_h_nr_running += idle_h_nr_running;
>  
> +		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			break;
> +			goto enqueue_throttle;
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  		update_cfs_group(se);
> +
> +		cfs_rq->h_nr_running++;
> +		cfs_rq->idle_h_nr_running += idle_h_nr_running;
>  	}
>  
> +enqueue_throttle:
>  	if (!se) {
>  		add_nr_running(rq, 1);
>  		/*

I'm having trouble reconciling the patch with the description and the
comments explaining the intent behind the code are unhelpful.

There are two loops before and after your patch -- the first dealing with
sched entities that are not on a runqueue and the second for the remaining
entities that are. The intent appears to be to update the load averages
once the entity is active on a runqueue.

I'm not getting why the changelog says everything related to cfs is
now done in one loop because there are still two. But even if you do
get throttled, it's not clear why you jump to the !se check given that
for_each_sched_entity did not complete. What it *does* appear to do is
have all the h_nr_running related to entities being enqueued updated in
one loop and all remaining entities stats updated in the other.

Following the accounting is tricky. Before the patch, if throttling
happened then h_nr_running was updated without updating the corresponding
nr_running counter in rq. They are out of sync until unthrottle_cfs_rq
is called at the very least. After your patch, the same is true and while
the accounting appears to be equivalent, it's not clear it's correct and
I do not find the code any easier to understand after the patch or how
it's connected to runnable_load_avg which this series is about :(

I think the patch is functionally ok but I am having trouble figuring
out the motive. Maybe it'll be obvious after I read the rest of the series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-11 17:46 ` [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg Vincent Guittot
@ 2020-02-12 13:37   ` Mel Gorman
  2020-02-12 15:03     ` Vincent Guittot
  2020-02-12 19:49     ` Mel Gorman
  0 siblings, 2 replies; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 13:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider

On Tue, Feb 11, 2020 at 06:46:49PM +0100, Vincent Guittot wrote:
> Similarly to what has been done for the normal load balance, we can
> replace runnable_load_avg by load_avg in numa load balancing and track
> other statistics like the utilization and the number of running tasks to
> get to better view of the current state of a node.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 102 ++++++++++++++++++++++++++++++++------------
>  1 file changed, 75 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a1ea02b5362e..6e4c2b012c48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
>  	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
>  }
>  
> -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> -
> -static unsigned long cpu_runnable_load(struct rq *rq)
> -{
> -	return cfs_rq_runnable_load_avg(&rq->cfs);
> -}
> +/*
> + * 'numa_type' describes the node at the moment of load balancing.
> + */
> +enum numa_type {
> +	/* The node has spare capacity that can be used to run more tasks.  */
> +	node_has_spare = 0,
> +	/*
> +	 * The node is fully used and the tasks don't compete for more CPU
> +	 * cycles. Nevertheless, some tasks might wait before running.
> +	 */
> +	node_fully_busy,
> +	/*
> +	 * The node is overloaded and can't provide expected CPU cycles to all
> +	 * tasks.
> +	 */
> +	node_overloaded
> +};
>  

Ok, to reconcile this with the load balancer, it would need to account
for imbalanced but that's ok in the context of this patch.

>  /* Cached statistics for all CPUs within a node */
>  struct numa_stats {
>  	unsigned long load;
> -
> +	unsigned long util;
>  	/* Total compute capacity of CPUs on a node */
>  	unsigned long compute_capacity;
> +	unsigned int nr_running;
> +	unsigned int weight;
> +	enum numa_type node_type;
>  };
>  
> -/*
> - * XXX borrowed from update_sg_lb_stats
> - */
> -static void update_numa_stats(struct numa_stats *ns, int nid)
> -{
> -	int cpu;
> -
> -	memset(ns, 0, sizeof(*ns));
> -	for_each_cpu(cpu, cpumask_of_node(nid)) {
> -		struct rq *rq = cpu_rq(cpu);
> -
> -		ns->load += cpu_runnable_load(rq);
> -		ns->compute_capacity += capacity_of(cpu);
> -	}
> -
> -}
> -
>  struct task_numa_env {
>  	struct task_struct *p;
>  
> @@ -1521,6 +1518,47 @@ struct task_numa_env {
>  	int best_cpu;
>  };
>  
> +static unsigned long cpu_load(struct rq *rq);
> +static unsigned long cpu_util(int cpu);
> +
> +static inline enum
> +numa_type numa_classify(unsigned int imbalance_pct,
> +			 struct numa_stats *ns)
> +{
> +	if ((ns->nr_running > ns->weight) &&
> +	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> +		return node_overloaded;
> +

Ok, so this is essentially group_is_overloaded.


> +	if ((ns->nr_running < ns->weight) ||
> +	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> +		return node_has_spare;
> +

And this is group_has_capacity. What I did was have a common helper
for both NUMA and normal load balancing and translated the fields from
sg_lb_stats and numa_stats into a common helper. This is to prevent them
getting out of sync. The conversion was incomplete in my case but in
principle, both NUMA and CPU load balancing should use common helpers or
they'll get out of sync.


> +	return node_fully_busy;
> +}
> +
> +/*
> + * XXX borrowed from update_sg_lb_stats
> + */
> +static void update_numa_stats(struct task_numa_env *env,
> +			      struct numa_stats *ns, int nid)
> +{
> +	int cpu;
> +
> +	memset(ns, 0, sizeof(*ns));
> +	for_each_cpu(cpu, cpumask_of_node(nid)) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		ns->load += cpu_load(rq);
> +		ns->util += cpu_util(cpu);
> +		ns->nr_running += rq->cfs.h_nr_running;
> +		ns->compute_capacity += capacity_of(cpu);
> +	}
> +
> +	ns->weight = cpumask_weight(cpumask_of_node(nid));
> +
> +	ns->node_type = numa_classify(env->imbalance_pct, ns);
> +}
> +

Ok, this is more or less what I had except I wedged other stuff in there
too specific to NUMA balancing to avoid multiple walks of the cpumask.

>  static void task_numa_assign(struct task_numa_env *env,
>  			     struct task_struct *p, long imp)
>  {
> @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
>  	long orig_src_load, orig_dst_load;
>  	long src_capacity, dst_capacity;
>  
> +
> +	/* If dst node has spare capacity, there is no real load imbalance */
> +	if (env->dst_stats.node_type == node_has_spare)
> +		return false;
> +

Not exactly what the load balancer thinks though, the load balancer
may decide to balance the tasks between NUMA groups even when there is
capacity. That said, what you did here is compatible with the current code.

I'll want to test this further but in general

Acked-by: Mel Gorman <mgorman@techsingularity.net>

However, I really would have preferred if numa_classify shared common
code with group_[is_overloaded|has_capacity] instead of having
equivalent but different implementations.

>  	/*
>  	 * The load is corrected for the CPU capacity available on each node.
>  	 *
> @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
>  	taskweight = task_weight(p, env.src_nid, dist);
>  	groupweight = group_weight(p, env.src_nid, dist);
> -	update_numa_stats(&env.src_stats, env.src_nid);
> +	update_numa_stats(&env, &env.src_stats, env.src_nid);
>  	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
>  	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
> -	update_numa_stats(&env.dst_stats, env.dst_nid);
> +	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
>  
>  	/* Try to find a spot on the preferred nid. */
>  	task_numa_find_cpu(&env, taskimp, groupimp);
> @@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
>  
>  			env.dist = dist;
>  			env.dst_nid = nid;
> -			update_numa_stats(&env.dst_stats, env.dst_nid);
> +			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
>  			task_numa_find_cpu(&env, taskimp, groupimp);
>  		}
>  	}
> @@ -5446,6 +5489,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
>  	return load;
>  }
>  
> +static unsigned long cpu_runnable_load(struct rq *rq)
> +{
> +	return cfs_rq_runnable_load_avg(&rq->cfs);
> +}
> +
>  static unsigned long capacity_of(int cpu)
>  {
>  	return cpu_rq(cpu)->cpu_capacity;
> -- 
> 2.17.1
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 3/4] sched/fair: replace runnable load average by runnable average
  2020-02-11 17:46 ` [RFC 3/4] sched/fair: replace runnable load average by runnable average Vincent Guittot
@ 2020-02-12 14:30   ` Mel Gorman
  2020-02-14  7:42     ` Vincent Guittot
  2020-02-13 17:36   ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 14:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider

On Tue, Feb 11, 2020 at 06:46:50PM +0100, Vincent Guittot wrote:
> Now that runnable_load_avg is not more used, we can replace it by a new
> signal that will highlight the runnable pressure on a cfs_rq. This signal
> track the waiting time of tasks on rq and can help to better define the
> state of rqs.
> 
> At now, only util_avg is used to define the state of a rq:
>   A rq with more that around 80% of utilization and more than 1 tasks is
>   considered as overloaded.
> 
> But the util_avg signal of a rq can become temporaly low after that a task
> migrated onto another rq which can bias the classification of the rq.
> 
> When tasks compete for the same rq, their runnable average signal will be
> higher than util_avg as it will include the waiting time and we can use
> this signal to better classify cfs_rqs.
> 
> The new runnable_avg will track the runnable time of a task which simply
> adds the waiting time to the running time. The runnbale _avg of cfs_rq
> will be the /Sum of se's runnable_avg and the runnable_avg of group entity
> will follow the one of the rq similarly to util_avg.
> 

s/runnbale/runnable/

Otherwise, all I can do is give a heads-up that I will not be able to
review this patch and the next patch properly in the short-term. While the
new metric appears to have a sensible definition, I've not spent enough
time comparing/contrasting the pro's and con's of PELT implementation
details or their consequences. I am not confident I can accurately
predict whether this is better or if there are corner cases that make
poor placement decisions based on fast changes of runnable_avg. At least
not within a reasonable amount of time.

This caught my attention though

> @@ -4065,8 +4018,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 *   - Add its new weight to cfs_rq->load.weight
>  	 */
>  	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
> +	se_update_runnable(se);
>  	update_cfs_group(se);
> -	enqueue_runnable_load_avg(cfs_rq, se);
>  	account_entity_enqueue(cfs_rq, se);
>  
>  	if (flags & ENQUEUE_WAKEUP)

I don't think the ordering matters any more because of what was removed
from update_cfs_group. Unfortunately, I'm not 100% confident so am
bringing it to your attention in case it does.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path
  2020-02-12 13:20   ` Mel Gorman
@ 2020-02-12 14:47     ` Vincent Guittot
  2020-02-12 16:11       ` Mel Gorman
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-12 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Wed, 12 Feb 2020 at 14:20, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:48PM +0100, Vincent Guittot wrote:
> > The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> > is split in 2 distinct parts for throttled cfs_rq without any added value
> > but making code less readable.
> >
> > Change the code ordering such that everything related to a cfs_rq
> > (throttled or not) will be done in the same loop.
> >
> > In addition, the same steps ordering is used when updating a cfs_rq:
> > - update_load_avg
> > - update_cfs_group
> > - update *h_nr_running
> >
> > No functional and performance changes are expected and have been noticed
> > during tests.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
> >  1 file changed, 20 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1a0ce83e835a..a1ea02b5362e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5259,32 +5259,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >               cfs_rq = cfs_rq_of(se);
> >               enqueue_entity(cfs_rq, se, flags);
> >
> > -             /*
> > -              * end evaluation on encountering a throttled cfs_rq
> > -              *
> > -              * note: in the case of encountering a throttled cfs_rq we will
> > -              * post the final h_nr_running increment below.
> > -              */
> > -             if (cfs_rq_throttled(cfs_rq))
> > -                     break;
> >               cfs_rq->h_nr_running++;
> >               cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >
> > +             /* end evaluation on encountering a throttled cfs_rq */
> > +             if (cfs_rq_throttled(cfs_rq))
> > +                     goto enqueue_throttle;
> > +
> >               flags = ENQUEUE_WAKEUP;
> >       }
> >
> >       for_each_sched_entity(se) {
> >               cfs_rq = cfs_rq_of(se);
> > -             cfs_rq->h_nr_running++;
> > -             cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >
> > +             /* end evaluation on encountering a throttled cfs_rq */
> >               if (cfs_rq_throttled(cfs_rq))
> > -                     break;
> > +                     goto enqueue_throttle;
> > AFAICT, there are in tip/sched/core
> >               update_load_avg(cfs_rq, se, UPDATE_TG);
> >               update_cfs_group(se);
> > +
> > +             cfs_rq->h_nr_running++;
> > +             cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >       }
> >
> > +enqueue_throttle:
> >       if (!se) {
> >               add_nr_running(rq, 1);
> >               /*
>
> I'm having trouble reconciling the patch with the description and the
> comments explaining the intent behind the code are unhelpful.
>
> There are two loops before and after your patch -- the first dealing with
> sched entities that are not on a runqueue and the second for the remaining
> entities that are. The intent appears to be to update the load averages
> once the entity is active on a runqueue.
>
> I'm not getting why the changelog says everything related to cfs is
> now done in one loop because there are still two. But even if you do
> get throttled, it's not clear why you jump to the !se check given that
> for_each_sched_entity did not complete. What it *does* appear to do is
> have all the h_nr_running related to entities being enqueued updated in
> one loop and all remaining entities stats updated in the other.

Let's take the example of 2 levels in addition to root so we have :
root->cfs1->cfs2
Now we enqueue a task T1 on cfs2 but cfs1 is throttled, we will have
the sequence:

In 1st for_each_sched_entity loop:
  loop 1
    enqueue_entity (T1->se, cfs2) which calls update load_avg(cfs2)
    cfs2->h_nr_running++;
  loop 2
    enqueue_entity (cfs2->gse, cfs1) which calls update load_avg(cfs1)
    break because cfs1 is throttled

In 2nd for_each_sched_entity loop:
  loop 1
    cfs1->h_nr_running++
    break because throttled

Using the 2nd loop for incrementing h_nr_running of the throttled cfs
is useless and we could do that directly in 1st loop and skip the 2nd
loop

With this patch we have :

In 1st for_each_sched_entity loop:
  loop 1
    enqueue_entity (T1->se, cfs2) which update load_avg(cfs2)
    cfs2->h_nr_running++;
  loop 2
    enqueue_entity (cfs2->gse, cfs1) which update load_avg(cfs1)
    cfs1->h_nr_running++
    skip the 2nd for_each_sched_entity entirely

Then the patch also reorders the call to update_load_avg() and the
increment of h_nr_running

Before the patch we had different order between the to
for_each_sched_entity which is not a problem because there is
currently no relation between both. But the following patches make
PELT using h_nr_running so we must have the same ordering to prevent
updating pelt with the wrong h_nr_running value

>
> Following the accounting is tricky. Before the patch, if throttling
> happened then h_nr_running was updated without updating the corresponding
> nr_running counter in rq. They are out of sync until unthrottle_cfs_rq
> is called at the very least. After your patch, the same is true and while
> the accounting appears to be equivalent, it's not clear it's correct and
> I do not find the code any easier to understand after the patch or how
> it's connected to runnable_load_avg which this series is about :(
>
> I think the patch is functionally ok but I am having trouble figuring
> out the motive. Maybe it'll be obvious after I read the rest of the series.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-12 13:37   ` Mel Gorman
@ 2020-02-12 15:03     ` Vincent Guittot
  2020-02-12 16:04       ` Mel Gorman
  2020-02-12 19:49     ` Mel Gorman
  1 sibling, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-12 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Wed, 12 Feb 2020 at 14:37, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:49PM +0100, Vincent Guittot wrote:
> > Similarly to what has been done for the normal load balance, we can
> > replace runnable_load_avg by load_avg in numa load balancing and track
> > other statistics like the utilization and the number of running tasks to
> > get to better view of the current state of a node.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 102 ++++++++++++++++++++++++++++++++------------
> >  1 file changed, 75 insertions(+), 27 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a1ea02b5362e..6e4c2b012c48 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> >              group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> >  }
> >
> > -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> > -
> > -static unsigned long cpu_runnable_load(struct rq *rq)
> > -{
> > -     return cfs_rq_runnable_load_avg(&rq->cfs);
> > -}
> > +/*
> > + * 'numa_type' describes the node at the moment of load balancing.
> > + */
> > +enum numa_type {
> > +     /* The node has spare capacity that can be used to run more tasks.  */
> > +     node_has_spare = 0,
> > +     /*
> > +      * The node is fully used and the tasks don't compete for more CPU
> > +      * cycles. Nevertheless, some tasks might wait before running.
> > +      */
> > +     node_fully_busy,
> > +     /*
> > +      * The node is overloaded and can't provide expected CPU cycles to all
> > +      * tasks.
> > +      */
> > +     node_overloaded
> > +};
> >
>
> Ok, to reconcile this with the load balancer, it would need to account
> for imbalanced but that's ok in the context of this patch.
>
> >  /* Cached statistics for all CPUs within a node */
> >  struct numa_stats {
> >       unsigned long load;
> > -
> > +     unsigned long util;
> >       /* Total compute capacity of CPUs on a node */
> >       unsigned long compute_capacity;
> > +     unsigned int nr_running;
> > +     unsigned int weight;
> > +     enum numa_type node_type;
> >  };
> >
> > -/*
> > - * XXX borrowed from update_sg_lb_stats
> > - */
> > -static void update_numa_stats(struct numa_stats *ns, int nid)
> > -{
> > -     int cpu;
> > -
> > -     memset(ns, 0, sizeof(*ns));
> > -     for_each_cpu(cpu, cpumask_of_node(nid)) {
> > -             struct rq *rq = cpu_rq(cpu);
> > -
> > -             ns->load += cpu_runnable_load(rq);
> > -             ns->compute_capacity += capacity_of(cpu);
> > -     }
> > -
> > -}
> > -
> >  struct task_numa_env {
> >       struct task_struct *p;
> >
> > @@ -1521,6 +1518,47 @@ struct task_numa_env {
> >       int best_cpu;
> >  };
> >
> > +static unsigned long cpu_load(struct rq *rq);
> > +static unsigned long cpu_util(int cpu);
> > +
> > +static inline enum
> > +numa_type numa_classify(unsigned int imbalance_pct,
> > +                      struct numa_stats *ns)
> > +{
> > +     if ((ns->nr_running > ns->weight) &&
> > +         ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> > +             return node_overloaded;
> > +
>
> Ok, so this is essentially group_is_overloaded.
>
>
> > +     if ((ns->nr_running < ns->weight) ||
> > +         ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> > +             return node_has_spare;
> > +
>
> And this is group_has_capacity. What I did was have a common helper
> for both NUMA and normal load balancing and translated the fields from
> sg_lb_stats and numa_stats into a common helper. This is to prevent them
> getting out of sync. The conversion was incomplete in my case but in
> principle, both NUMA and CPU load balancing should use common helpers or
> they'll get out of sync.

I fact, I wanted to keep this patch simple and readable for the 1st
version in order to not afraid people from reviewing it. That's the
main reason I didn't merge it with load_balance but i agree that some
common helper function might be possible.

Also the struct sg_lb_stats has a lot more fields compared to struct numa_stats

Then, I wonder if we could end up with different rules for numa like
taking into account some NUMA specifics metrics to classify the node

>
>
> > +     return node_fully_busy;
> > +}
> > +
> > +/*
> > + * XXX borrowed from update_sg_lb_stats
> > + */
> > +static void update_numa_stats(struct task_numa_env *env,
> > +                           struct numa_stats *ns, int nid)
> > +{
> > +     int cpu;
> > +
> > +     memset(ns, 0, sizeof(*ns));
> > +     for_each_cpu(cpu, cpumask_of_node(nid)) {
> > +             struct rq *rq = cpu_rq(cpu);
> > +
> > +             ns->load += cpu_load(rq);
> > +             ns->util += cpu_util(cpu);
> > +             ns->nr_running += rq->cfs.h_nr_running;
> > +             ns->compute_capacity += capacity_of(cpu);
> > +     }
> > +
> > +     ns->weight = cpumask_weight(cpumask_of_node(nid));
> > +
> > +     ns->node_type = numa_classify(env->imbalance_pct, ns);
> > +}
> > +
>
> Ok, this is more or less what I had except I wedged other stuff in there
> too specific to NUMA balancing to avoid multiple walks of the cpumask.
>
> >  static void task_numa_assign(struct task_numa_env *env,
> >                            struct task_struct *p, long imp)
> >  {
> > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> >       long orig_src_load, orig_dst_load;
> >       long src_capacity, dst_capacity;
> >
> > +
> > +     /* If dst node has spare capacity, there is no real load imbalance */
> > +     if (env->dst_stats.node_type == node_has_spare)
> > +             return false;
> > +
>
> Not exactly what the load balancer thinks though, the load balancer
> may decide to balance the tasks between NUMA groups even when there is
> capacity. That said, what you did here is compatible with the current code.
>
> I'll want to test this further but in general
>
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
>
> However, I really would have preferred if numa_classify shared common
> code with group_[is_overloaded|has_capacity] instead of having
> equivalent but different implementations.
>
> >       /*
> >        * The load is corrected for the CPU capacity available on each node.
> >        *
> > @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
> >       dist = env.dist = node_distance(env.src_nid, env.dst_nid);
> >       taskweight = task_weight(p, env.src_nid, dist);
> >       groupweight = group_weight(p, env.src_nid, dist);
> > -     update_numa_stats(&env.src_stats, env.src_nid);
> > +     update_numa_stats(&env, &env.src_stats, env.src_nid);
> >       taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
> >       groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
> > -     update_numa_stats(&env.dst_stats, env.dst_nid);
> > +     update_numa_stats(&env, &env.dst_stats, env.dst_nid);
> >
> >       /* Try to find a spot on the preferred nid. */
> >       task_numa_find_cpu(&env, taskimp, groupimp);
> > @@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
> >
> >                       env.dist = dist;
> >                       env.dst_nid = nid;
> > -                     update_numa_stats(&env.dst_stats, env.dst_nid);
> > +                     update_numa_stats(&env, &env.dst_stats, env.dst_nid);
> >                       task_numa_find_cpu(&env, taskimp, groupimp);
> >               }
> >       }
> > @@ -5446,6 +5489,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
> >       return load;
> >  }
> >
> > +static unsigned long cpu_runnable_load(struct rq *rq)
> > +{
> > +     return cfs_rq_runnable_load_avg(&rq->cfs);
> > +}
> > +
> >  static unsigned long capacity_of(int cpu)
> >  {
> >       return cpu_rq(cpu)->cpu_capacity;
> > --
> > 2.17.1
> >
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-12 15:03     ` Vincent Guittot
@ 2020-02-12 16:04       ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 16:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Wed, Feb 12, 2020 at 04:03:28PM +0100, Vincent Guittot wrote:
> > Ok, so this is essentially group_is_overloaded.
> >
> >
> > > +     if ((ns->nr_running < ns->weight) ||
> > > +         ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> > > +             return node_has_spare;
> > > +
> >
> > And this is group_has_capacity. What I did was have a common helper
> > for both NUMA and normal load balancing and translated the fields from
> > sg_lb_stats and numa_stats into a common helper. This is to prevent them
> > getting out of sync. The conversion was incomplete in my case but in
> > principle, both NUMA and CPU load balancing should use common helpers or
> > they'll get out of sync.
> 
> I fact, I wanted to keep this patch simple and readable for the 1st
> version in order to not afraid people from reviewing it. That's the
> main reason I didn't merge it with load_balance but i agree that some
> common helper function might be possible.
> 

Makes sense.

> Also the struct sg_lb_stats has a lot more fields compared to struct numa_stats
> 

Yes, I considered reusing the same structure and decided against it. I
simply created a common helper. It's trivial enough to do on top after
the fact in the name of clarity. Fundamentally it's cosmetic.

> Then, I wonder if we could end up with different rules for numa like
> taking into account some NUMA specifics metrics to classify the node
> 

Well, we could but right now they should be the same. As it is, the NUMA
balancer and load balancer overrule each other. I think the scope for
changing that without causing regressions is limited.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path
  2020-02-12 14:47     ` Vincent Guittot
@ 2020-02-12 16:11       ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 16:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Wed, Feb 12, 2020 at 03:47:30PM +0100, Vincent Guittot wrote:
> > I'm having trouble reconciling the patch with the description and the
> > comments explaining the intent behind the code are unhelpful.
> >
> > There are two loops before and after your patch -- the first dealing with
> > sched entities that are not on a runqueue and the second for the remaining
> > entities that are. The intent appears to be to update the load averages
> > once the entity is active on a runqueue.
> >
> > I'm not getting why the changelog says everything related to cfs is
> > now done in one loop because there are still two. But even if you do
> > get throttled, it's not clear why you jump to the !se check given that
> > for_each_sched_entity did not complete. What it *does* appear to do is
> > have all the h_nr_running related to entities being enqueued updated in
> > one loop and all remaining entities stats updated in the other.
> 
> Let's take the example of 2 levels in addition to root so we have :
> root->cfs1->cfs2
> Now we enqueue a task T1 on cfs2 but cfs1 is throttled, we will have
> the sequence:
> 
> In 1st for_each_sched_entity loop:
>   loop 1
>     enqueue_entity (T1->se, cfs2) which calls update load_avg(cfs2)
>     cfs2->h_nr_running++;
>   loop 2
>     enqueue_entity (cfs2->gse, cfs1) which calls update load_avg(cfs1)
>     break because cfs1 is throttled
> 
> In 2nd for_each_sched_entity loop:
>   loop 1
>     cfs1->h_nr_running++
>     break because throttled
> 
> Using the 2nd loop for incrementing h_nr_running of the throttled cfs
> is useless and we could do that directly in 1st loop and skip the 2nd
> loop
> 
> With this patch we have :
> 
> In 1st for_each_sched_entity loop:
>   loop 1
>     enqueue_entity (T1->se, cfs2) which update load_avg(cfs2)
>     cfs2->h_nr_running++;
>   loop 2
>     enqueue_entity (cfs2->gse, cfs1) which update load_avg(cfs1)
>     cfs1->h_nr_running++
>     skip the 2nd for_each_sched_entity entirely
> 
> Then the patch also reorders the call to update_load_avg() and the
> increment of h_nr_running
> 
> Before the patch we had different order between the to
> for_each_sched_entity which is not a problem because there is
> currently no relation between both. But the following patches make
> PELT using h_nr_running so we must have the same ordering to prevent
> updating pelt with the wrong h_nr_running value
> 

Ok, understood. Thanks for clearing that up!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-12 13:37   ` Mel Gorman
  2020-02-12 15:03     ` Vincent Guittot
@ 2020-02-12 19:49     ` Mel Gorman
  2020-02-12 21:29       ` Mel Gorman
  2020-02-13  8:05       ` Vincent Guittot
  1 sibling, 2 replies; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 19:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider

On Wed, Feb 12, 2020 at 01:37:15PM +0000, Mel Gorman wrote:
> >  static void task_numa_assign(struct task_numa_env *env,
> >  			     struct task_struct *p, long imp)
> >  {
> > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> >  	long orig_src_load, orig_dst_load;
> >  	long src_capacity, dst_capacity;
> >  
> > +
> > +	/* If dst node has spare capacity, there is no real load imbalance */
> > +	if (env->dst_stats.node_type == node_has_spare)
> > +		return false;
> > +
> 
> Not exactly what the load balancer thinks though, the load balancer
> may decide to balance the tasks between NUMA groups even when there is
> capacity. That said, what you did here is compatible with the current code.
> 
> I'll want to test this further but in general
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 

And I was wrong. I worried that the conflict with the load balancer would
be a problem when I wrote this comment.  That was based on finding that
I had to account for the load balancer when using capacity to decide
whether an idle CPU can be used. When I didn't, performance went to
hell but thought that maybe you had somehow avoided the same problem.
Unfortunately, testing indicates that the same, or similar, problem is
here.

This is specjbb running two JVMs on a 2-socket Haswell machine with 48
cores in total. This is just your first two patches, I have found that
the group classify patches do not recover it.

Hmean     tput-1     39748.93 (   0.00%)    39568.77 (  -0.45%)
Hmean     tput-2     88648.59 (   0.00%)    85302.00 (  -3.78%)
Hmean     tput-3    136285.01 (   0.00%)   131280.85 (  -3.67%)
Hmean     tput-4    181312.69 (   0.00%)   179903.17 (  -0.78%)
Hmean     tput-5    228725.85 (   0.00%)   223499.65 (  -2.28%)
Hmean     tput-6    273246.83 (   0.00%)   268984.52 (  -1.56%)
Hmean     tput-7    317708.89 (   0.00%)   314593.10 (  -0.98%)
Hmean     tput-8    362378.08 (   0.00%)   337205.87 *  -6.95%*
Hmean     tput-9    403792.00 (   0.00%)   349014.39 * -13.57%*
Hmean     tput-10   446000.88 (   0.00%)   379242.88 * -14.97%*
Hmean     tput-11   486188.58 (   0.00%)   341951.27 * -29.67%*
Hmean     tput-12   522288.84 (   0.00%)   388877.07 * -25.54%*
Hmean     tput-13   532394.06 (   0.00%)   437005.34 * -17.92%*
Hmean     tput-14   539440.66 (   0.00%)   467479.65 * -13.34%*
Hmean     tput-15   541843.50 (   0.00%)   455477.03 ( -15.94%)
Hmean     tput-16   546510.71 (   0.00%)   483599.58 ( -11.51%)
Hmean     tput-17   544501.21 (   0.00%)   506189.64 (  -7.04%)
Hmean     tput-18   544802.98 (   0.00%)   535869.05 (  -1.64%)
Hmean     tput-19   545265.27 (   0.00%)   533050.57 (  -2.24%)
Hmean     tput-20   543284.33 (   0.00%)   528100.03 (  -2.79%)
Hmean     tput-21   543375.11 (   0.00%)   528692.97 (  -2.70%)
Hmean     tput-22   542536.60 (   0.00%)   527160.27 (  -2.83%)
Hmean     tput-23   536402.28 (   0.00%)   521023.49 (  -2.87%)
Hmean     tput-24   532307.76 (   0.00%)   519208.85 *  -2.46%*
Stddev    tput-1      1426.23 (   0.00%)      464.57 (  67.43%)
Stddev    tput-2      4437.10 (   0.00%)     1489.17 (  66.44%)
Stddev    tput-3      3021.47 (   0.00%)      750.95 (  75.15%)
Stddev    tput-4      4098.39 (   0.00%)     1678.67 (  59.04%)
Stddev    tput-5      3524.22 (   0.00%)     1025.30 (  70.91%)
Stddev    tput-6      3237.13 (   0.00%)     2198.39 (  32.09%)
Stddev    tput-7      2534.27 (   0.00%)     3261.18 ( -28.68%)
Stddev    tput-8      3847.37 (   0.00%)     4355.78 ( -13.21%)
Stddev    tput-9      5278.55 (   0.00%)     4145.06 (  21.47%)
Stddev    tput-10     5311.08 (   0.00%)    10274.26 ( -93.45%)
Stddev    tput-11     7537.76 (   0.00%)    16882.17 (-123.97%)
Stddev    tput-12     5023.29 (   0.00%)    12735.70 (-153.53%)
Stddev    tput-13     3852.32 (   0.00%)    15385.23 (-299.38%)
Stddev    tput-14    11859.59 (   0.00%)    24273.56 (-104.67%)
Stddev    tput-15    16298.10 (   0.00%)    45409.69 (-178.62%)
Stddev    tput-16     9041.77 (   0.00%)    58839.77 (-550.75%)
Stddev    tput-17     9322.50 (   0.00%)    77205.45 (-728.16%)
Stddev    tput-18    16040.01 (   0.00%)    15021.07 (   6.35%)
Stddev    tput-19     8814.09 (   0.00%)    27509.99 (-212.11%)
Stddev    tput-20     7812.82 (   0.00%)    31578.68 (-304.19%)
Stddev    tput-21     6584.58 (   0.00%)     3639.48 (  44.73%)
Stddev    tput-22     8294.36 (   0.00%)     3033.49 (  63.43%)
Stddev    tput-23     6887.93 (   0.00%)     5450.38 (  20.87%)
Stddev    tput-24     6081.83 (   0.00%)      390.32 (  93.58%)

Note that how it reaches the point where the node is almost utilised
( near tput-12) that performance drops.  Graphing mpstat on a per-node
basis shows there is imbalance in the CPU utilisation between nodes.

NUMA locality is not bad, scanning and migrations are a bit higher but
overall this is a problem.

On a 4-socket machine, I see slightly different results -- there is still
a big drop although not a statistically significant one. However, the
per-node CPU utilisation is skewed as nodes approach being fully utilised.

I see similarly bad results on a 2-socket Broadwell machine.

I don't have other results yet for other workloads but I'm not optimistic
that I'll see anything different. I still think that this probably didn't
show up with hackbench because it's too short-lived for NUMA balancing
to be a factor.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-12 19:49     ` Mel Gorman
@ 2020-02-12 21:29       ` Mel Gorman
  2020-02-13  8:05       ` Vincent Guittot
  1 sibling, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2020-02-12 21:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider

On Wed, Feb 12, 2020 at 07:49:03PM +0000, Mel Gorman wrote:
> On Wed, Feb 12, 2020 at 01:37:15PM +0000, Mel Gorman wrote:
> > >  static void task_numa_assign(struct task_numa_env *env,
> > >  			     struct task_struct *p, long imp)
> > >  {
> > > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> > >  	long orig_src_load, orig_dst_load;
> > >  	long src_capacity, dst_capacity;
> > >  
> > > +
> > > +	/* If dst node has spare capacity, there is no real load imbalance */
> > > +	if (env->dst_stats.node_type == node_has_spare)
> > > +		return false;
> > > +
> > 
> > Not exactly what the load balancer thinks though, the load balancer
> > may decide to balance the tasks between NUMA groups even when there is
> > capacity. That said, what you did here is compatible with the current code.
> > 
> > I'll want to test this further but in general
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> 
> And I was wrong. I worried that the conflict with the load balancer would
> be a problem when I wrote this comment.  That was based on finding that
> I had to account for the load balancer when using capacity to decide
> whether an idle CPU can be used. When I didn't, performance went to
> hell but thought that maybe you had somehow avoided the same problem.
> Unfortunately, testing indicates that the same, or similar, problem is
> here.
> 

Maybe this on top of your patch? I've slotted it in and will see what
falls out. You may notice a stray hunk related to cpu_runnable_load.
It's because an earlier patch introduces it unnecessarily and only needs
it later in your series. It's a minor cleanup of the overall series but
easiest to illustrate for now.

--8<--
sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity

The standard load balancer generally tries to keep the number of running
tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows
tasks to move if there is spare capacity but this causes a conflict and
utilisation between NUMA nodes gets badly skewed. This patch uses similar
logic between the NUMA balancer and load balancer when deciding if a task
migrating to its preferred node can use an idle CPU.

Signed-off-by: Mel Gorman <mgorman@suse.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9cabfb4fe505..ee80f31ec710 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1520,6 +1520,7 @@ struct task_numa_env {
 
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_util(int cpu);
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1594,11 +1595,6 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
-
-	/* If dst node has spare capacity, there is no real load imbalance */
-	if (env->dst_stats.node_type == node_has_spare)
-		return false;
-
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1757,19 +1753,36 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
-	long src_load, dst_load, load;
 	bool maymove = false;
 	int cpu;
 
-	load = task_h_load(env->p);
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
-
 	/*
-	 * If the improvement from just moving env->p direction is better
-	 * than swapping tasks around, check if a move is possible.
+	 * If dst node has spare capacity, then check if there is an
+	 * imbalance that would be overruled by the load balancer.
 	 */
-	maymove = !load_too_imbalanced(src_load, dst_load, env);
+	if (env->dst_stats.node_type == node_has_spare) {
+		unsigned int imbalance;
+		int src_running, dst_running;
+
+		/* Would movement cause an imbalance? */
+		src_running = env->src_stats.nr_running - 1;
+		dst_running = env->src_stats.nr_running + 1;
+		imbalance = max(0, dst_running - src_running);
+		imbalance = adjust_numa_imbalance(imbalance, src_running);
+
+		/* Use idle CPU if there is no imbalance */
+		maymove = true;
+	} else {
+		long src_load, dst_load, load;
+		/*
+		 * If the improvement from just moving env->p direction is better
+		 * than swapping tasks around, check if a move is possible.
+		 */
+		load = task_h_load(env->p);
+		dst_load = env->dst_stats.load + load;
+		src_load = env->src_stats.load - load;
+		maymove = !load_too_imbalanced(src_load, dst_load, env);
+	}
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
@@ -5491,11 +5504,6 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
-
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -8705,6 +8713,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+{
+	unsigned int imbalance_min;
+
+	/*
+	 * Allow a small imbalance based on a simple pair of communicating
+	 * tasks that remain local when the source domain is almost idle.
+	 */
+	imbalance_min = 2;
+	if (src_nr_running <= imbalance_min)
+		return 0;
+
+	return imbalance;
+}
+
 /**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
@@ -8801,24 +8824,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA) {
-			unsigned int imbalance_min;
-
-			/*
-			 * Compute an allowed imbalance based on a simple
-			 * pair of communicating tasks that should remain
-			 * local and ignore them.
-			 *
-			 * NOTE: Generally this would have been based on
-			 * the domain size and this was evaluated. However,
-			 * the benefit is similar across a range of workloads
-			 * and machines but scaling by the domain size adds
-			 * the risk that lower domains have to be rebalanced.
-			 */
-			imbalance_min = 2;
-			if (busiest->sum_nr_running <= imbalance_min)
-				env->imbalance = 0;
-		}
+		if (env->sd->flags & SD_NUMA)
+			env->imbalance = adjust_numa_imbalance(env->imbalance,
+						busiest->sum_nr_running);
 
 		return;
 	}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-12 19:49     ` Mel Gorman
  2020-02-12 21:29       ` Mel Gorman
@ 2020-02-13  8:05       ` Vincent Guittot
  2020-02-13  9:24         ` Mel Gorman
       [not found]         ` <20200213131658.9600-1-hdanton@sina.com>
  1 sibling, 2 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-13  8:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Wed, 12 Feb 2020 at 20:49, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Feb 12, 2020 at 01:37:15PM +0000, Mel Gorman wrote:
> > >  static void task_numa_assign(struct task_numa_env *env,
> > >                          struct task_struct *p, long imp)
> > >  {
> > > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> > >     long orig_src_load, orig_dst_load;
> > >     long src_capacity, dst_capacity;
> > >
> > > +
> > > +   /* If dst node has spare capacity, there is no real load imbalance */
> > > +   if (env->dst_stats.node_type == node_has_spare)
> > > +           return false;
> > > +
> >
> > Not exactly what the load balancer thinks though, the load balancer
> > may decide to balance the tasks between NUMA groups even when there is
> > capacity. That said, what you did here is compatible with the current code.
> >
> > I'll want to test this further but in general
> >
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> >
>
> And I was wrong. I worried that the conflict with the load balancer would
> be a problem when I wrote this comment.  That was based on finding that
> I had to account for the load balancer when using capacity to decide
> whether an idle CPU can be used. When I didn't, performance went to
> hell but thought that maybe you had somehow avoided the same problem.
> Unfortunately, testing indicates that the same, or similar, problem is
> here.
>
> This is specjbb running two JVMs on a 2-socket Haswell machine with 48
> cores in total. This is just your first two patches, I have found that

thanks for the test results. Unfortunately i haven't specjbb to
analyse the metrics and study what is going wrong

> the group classify patches do not recover it.
>
> Hmean     tput-1     39748.93 (   0.00%)    39568.77 (  -0.45%)
> Hmean     tput-2     88648.59 (   0.00%)    85302.00 (  -3.78%)
> Hmean     tput-3    136285.01 (   0.00%)   131280.85 (  -3.67%)
> Hmean     tput-4    181312.69 (   0.00%)   179903.17 (  -0.78%)
> Hmean     tput-5    228725.85 (   0.00%)   223499.65 (  -2.28%)
> Hmean     tput-6    273246.83 (   0.00%)   268984.52 (  -1.56%)
> Hmean     tput-7    317708.89 (   0.00%)   314593.10 (  -0.98%)
> Hmean     tput-8    362378.08 (   0.00%)   337205.87 *  -6.95%*
> Hmean     tput-9    403792.00 (   0.00%)   349014.39 * -13.57%*
> Hmean     tput-10   446000.88 (   0.00%)   379242.88 * -14.97%*
> Hmean     tput-11   486188.58 (   0.00%)   341951.27 * -29.67%*
> Hmean     tput-12   522288.84 (   0.00%)   388877.07 * -25.54%*
> Hmean     tput-13   532394.06 (   0.00%)   437005.34 * -17.92%*
> Hmean     tput-14   539440.66 (   0.00%)   467479.65 * -13.34%*
> Hmean     tput-15   541843.50 (   0.00%)   455477.03 ( -15.94%)
> Hmean     tput-16   546510.71 (   0.00%)   483599.58 ( -11.51%)
> Hmean     tput-17   544501.21 (   0.00%)   506189.64 (  -7.04%)
> Hmean     tput-18   544802.98 (   0.00%)   535869.05 (  -1.64%)
> Hmean     tput-19   545265.27 (   0.00%)   533050.57 (  -2.24%)
> Hmean     tput-20   543284.33 (   0.00%)   528100.03 (  -2.79%)
> Hmean     tput-21   543375.11 (   0.00%)   528692.97 (  -2.70%)
> Hmean     tput-22   542536.60 (   0.00%)   527160.27 (  -2.83%)
> Hmean     tput-23   536402.28 (   0.00%)   521023.49 (  -2.87%)
> Hmean     tput-24   532307.76 (   0.00%)   519208.85 *  -2.46%*
> Stddev    tput-1      1426.23 (   0.00%)      464.57 (  67.43%)
> Stddev    tput-2      4437.10 (   0.00%)     1489.17 (  66.44%)
> Stddev    tput-3      3021.47 (   0.00%)      750.95 (  75.15%)
> Stddev    tput-4      4098.39 (   0.00%)     1678.67 (  59.04%)
> Stddev    tput-5      3524.22 (   0.00%)     1025.30 (  70.91%)
> Stddev    tput-6      3237.13 (   0.00%)     2198.39 (  32.09%)
> Stddev    tput-7      2534.27 (   0.00%)     3261.18 ( -28.68%)
> Stddev    tput-8      3847.37 (   0.00%)     4355.78 ( -13.21%)
> Stddev    tput-9      5278.55 (   0.00%)     4145.06 (  21.47%)
> Stddev    tput-10     5311.08 (   0.00%)    10274.26 ( -93.45%)
> Stddev    tput-11     7537.76 (   0.00%)    16882.17 (-123.97%)
> Stddev    tput-12     5023.29 (   0.00%)    12735.70 (-153.53%)
> Stddev    tput-13     3852.32 (   0.00%)    15385.23 (-299.38%)
> Stddev    tput-14    11859.59 (   0.00%)    24273.56 (-104.67%)
> Stddev    tput-15    16298.10 (   0.00%)    45409.69 (-178.62%)
> Stddev    tput-16     9041.77 (   0.00%)    58839.77 (-550.75%)
> Stddev    tput-17     9322.50 (   0.00%)    77205.45 (-728.16%)
> Stddev    tput-18    16040.01 (   0.00%)    15021.07 (   6.35%)
> Stddev    tput-19     8814.09 (   0.00%)    27509.99 (-212.11%)
> Stddev    tput-20     7812.82 (   0.00%)    31578.68 (-304.19%)
> Stddev    tput-21     6584.58 (   0.00%)     3639.48 (  44.73%)
> Stddev    tput-22     8294.36 (   0.00%)     3033.49 (  63.43%)
> Stddev    tput-23     6887.93 (   0.00%)     5450.38 (  20.87%)
> Stddev    tput-24     6081.83 (   0.00%)      390.32 (  93.58%)
>
> Note that how it reaches the point where the node is almost utilised
> ( near tput-12) that performance drops.  Graphing mpstat on a per-node
> basis shows there is imbalance in the CPU utilisation between nodes.

ok. So the has_spare_capacity condition that i added in
load_too_imbalanced() is probably to aggressive although the goal was
to let task moving on their preferred node

>
> NUMA locality is not bad, scanning and migrations are a bit higher but
> overall this is a problem.
>
> On a 4-socket machine, I see slightly different results -- there is still
> a big drop although not a statistically significant one. However, the
> per-node CPU utilisation is skewed as nodes approach being fully utilised.
>
> I see similarly bad results on a 2-socket Broadwell machine.
>
> I don't have other results yet for other workloads but I'm not optimistic
> that I'll see anything different. I still think that this probably didn't
> show up with hackbench because it's too short-lived for NUMA balancing
> to be a factor.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13  8:05       ` Vincent Guittot
@ 2020-02-13  9:24         ` Mel Gorman
       [not found]         ` <20200213131658.9600-1-hdanton@sina.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2020-02-13  9:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Thu, Feb 13, 2020 at 09:05:35AM +0100, Vincent Guittot wrote:
> > And I was wrong. I worried that the conflict with the load balancer would
> > be a problem when I wrote this comment.  That was based on finding that
> > I had to account for the load balancer when using capacity to decide
> > whether an idle CPU can be used. When I didn't, performance went to
> > hell but thought that maybe you had somehow avoided the same problem.
> > Unfortunately, testing indicates that the same, or similar, problem is
> > here.
> >
> > This is specjbb running two JVMs on a 2-socket Haswell machine with 48
> > cores in total. This is just your first two patches, I have found that
> 
> thanks for the test results. Unfortunately i haven't specjbb to
> analyse the metrics and study what is going wrong
> 

That's fine. There will be alternative tests that show a similar
problem. I generally like specjbb because it's more predictable than
others while still being complex enough from a scheduler perspective to be
interesting.

> > Note that how it reaches the point where the node is almost utilised
> > ( near tput-12) that performance drops.  Graphing mpstat on a per-node
> > basis shows there is imbalance in the CPU utilisation between nodes.
> 
> ok. So the has_spare_capacity condition that i added in
> load_too_imbalanced() is probably to aggressive although the goal was
> to let task moving on their preferred node
> 

Yes. It's a "fun" challenge to balance compute capacity, memory bandwidth
and memory locality. The patch on top I posted yesterday had a error
but fortunately I caught it last night after one test showed surprising
results. The corrected version is below. With it, the degree to which
both CPU load balance and NUMA balancing allows an imbalance between two
nodes to exist is controlled from one common function. It appears to work
but I want more than one test to complete before I bet money on it.

The one test is close to performance-neutral even though 4-socket shows
more system CPU usage than I'd like. This is still an improvement given
that the two balancers now make decisions based on the same metrics
without regressing (yet). I'll continue looking at how both of our series
can play nice together but so far, this is what I think is needed after
patch 2 of your series;

--8<--
sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity

The standard load balancer generally tries to keep the number of running
tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows
tasks to move if there is spare capacity but this causes a conflict and
utilisation between NUMA nodes gets badly skewed. This patch uses similar
logic between the NUMA balancer and load balancer when deciding if a task
migrating to its preferred node can use an idle CPU.

Signed-off-by: Mel Gorman <mgorman@suse.com>
---
 kernel/sched/fair.c | 76 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52e74b53d6e7..5a34752b8dbe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1520,6 +1520,7 @@ struct task_numa_env {
 
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_util(int cpu);
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1594,11 +1595,6 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
-
-	/* If dst node has spare capacity, there is no real load imbalance */
-	if (env->dst_stats.node_type == node_has_spare)
-		return false;
-
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1757,19 +1753,37 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
-	long src_load, dst_load, load;
 	bool maymove = false;
 	int cpu;
 
-	load = task_h_load(env->p);
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
-
 	/*
-	 * If the improvement from just moving env->p direction is better
-	 * than swapping tasks around, check if a move is possible.
+	 * If dst node has spare capacity, then check if there is an
+	 * imbalance that would be overruled by the load balancer.
 	 */
-	maymove = !load_too_imbalanced(src_load, dst_load, env);
+	if (env->dst_stats.node_type == node_has_spare) {
+		unsigned int imbalance;
+		int src_running, dst_running;
+
+		/* Would movement cause an imbalance? */
+		src_running = env->src_stats.nr_running - 1;
+		dst_running = env->src_stats.nr_running + 1;
+		imbalance = max(0, dst_running - src_running);
+		imbalance = adjust_numa_imbalance(imbalance, src_running);
+
+		/* Use idle CPU if there is no imbalance */
+		if (!imbalance)
+			maymove = true;
+	} else {
+		long src_load, dst_load, load;
+		/*
+		 * If the improvement from just moving env->p direction is better
+		 * than swapping tasks around, check if a move is possible.
+		 */
+		load = task_h_load(env->p);
+		dst_load = env->dst_stats.load + load;
+		src_load = env->src_stats.load - load;
+		maymove = !load_too_imbalanced(src_load, dst_load, env);
+	}
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
@@ -8700,6 +8714,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+{
+	unsigned int imbalance_min;
+
+	/*
+	 * Allow a small imbalance based on a simple pair of communicating
+	 * tasks that remain local when the source domain is almost idle.
+	 */
+	imbalance_min = 2;
+	if (src_nr_running <= imbalance_min)
+		return 0;
+
+	return imbalance;
+}
+
 /**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
@@ -8796,24 +8825,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA) {
-			unsigned int imbalance_min;
-
-			/*
-			 * Compute an allowed imbalance based on a simple
-			 * pair of communicating tasks that should remain
-			 * local and ignore them.
-			 *
-			 * NOTE: Generally this would have been based on
-			 * the domain size and this was evaluated. However,
-			 * the benefit is similar across a range of workloads
-			 * and machines but scaling by the domain size adds
-			 * the risk that lower domains have to be rebalanced.
-			 */
-			imbalance_min = 2;
-			if (busiest->sum_nr_running <= imbalance_min)
-				env->imbalance = 0;
-		}
+		if (env->sd->flags & SD_NUMA)
+			env->imbalance = adjust_numa_imbalance(env->imbalance,
+						busiest->sum_nr_running);
 
 		return;
 	}
-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
       [not found]         ` <20200213131658.9600-1-hdanton@sina.com>
@ 2020-02-13 13:46           ` Mel Gorman
  2020-02-13 15:00             ` Phil Auld
  0 siblings, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-13 13:46 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Phil Auld, Parth Shah, Valentin Schneider

On Thu, Feb 13, 2020 at 09:16:58PM +0800, Hillf Danton wrote:
> > -	load = task_h_load(env->p);
> > -	dst_load = env->dst_stats.load + load;
> > -	src_load = env->src_stats.load - load;
> > -
> >  	/*
> > -	 * If the improvement from just moving env->p direction is better
> > -	 * than swapping tasks around, check if a move is possible.
> > +	 * If dst node has spare capacity, then check if there is an
> > +	 * imbalance that would be overruled by the load balancer.
> >  	 */
> > -	maymove = !load_too_imbalanced(src_load, dst_load, env);
> > +	if (env->dst_stats.node_type == node_has_spare) {
> > +		unsigned int imbalance;
> > +		int src_running, dst_running;
> > +
> > +		/* Would movement cause an imbalance? */
> > +		src_running = env->src_stats.nr_running - 1;
> > +		dst_running = env->src_stats.nr_running + 1;
> > +		imbalance = max(0, dst_running - src_running);
> 
> Have trouble working out why 2 is magician again to make your test data nicer :P
> 

This is calculating what the nr_running would be after the move and
checking if an imbalance exists after the move forcing the load balancer
to intervene.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 13:46           ` Mel Gorman
@ 2020-02-13 15:00             ` Phil Auld
  2020-02-13 15:14               ` Mel Gorman
  0 siblings, 1 reply; 32+ messages in thread
From: Phil Auld @ 2020-02-13 15:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hillf Danton, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	linux-kernel, Parth Shah, Valentin Schneider

On Thu, Feb 13, 2020 at 01:46:55PM +0000 Mel Gorman wrote:
> On Thu, Feb 13, 2020 at 09:16:58PM +0800, Hillf Danton wrote:
> > > -	load = task_h_load(env->p);
> > > -	dst_load = env->dst_stats.load + load;
> > > -	src_load = env->src_stats.load - load;
> > > -
> > >  	/*
> > > -	 * If the improvement from just moving env->p direction is better
> > > -	 * than swapping tasks around, check if a move is possible.
> > > +	 * If dst node has spare capacity, then check if there is an
> > > +	 * imbalance that would be overruled by the load balancer.
> > >  	 */
> > > -	maymove = !load_too_imbalanced(src_load, dst_load, env);
> > > +	if (env->dst_stats.node_type == node_has_spare) {
> > > +		unsigned int imbalance;
> > > +		int src_running, dst_running;
> > > +
> > > +		/* Would movement cause an imbalance? */
> > > +		src_running = env->src_stats.nr_running - 1;
> > > +		dst_running = env->src_stats.nr_running + 1;
> > > +		imbalance = max(0, dst_running - src_running);
> > 
> > Have trouble working out why 2 is magician again to make your test data nicer :P
> > 
> 
> This is calculating what the nr_running would be after the move and
> checking if an imbalance exists after the move forcing the load balancer
> to intervene.

Isn't that always going to work out to 2? 


Cheers,
Phil

> 
> -- 
> Mel Gorman
> SUSE Labs
> 

-- 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 15:00             ` Phil Auld
@ 2020-02-13 15:14               ` Mel Gorman
  2020-02-13 16:11                 ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-13 15:14 UTC (permalink / raw)
  To: Phil Auld
  Cc: Hillf Danton, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	linux-kernel, Parth Shah, Valentin Schneider

On Thu, Feb 13, 2020 at 10:00:26AM -0500, Phil Auld wrote:
> On Thu, Feb 13, 2020 at 01:46:55PM +0000 Mel Gorman wrote:
> > On Thu, Feb 13, 2020 at 09:16:58PM +0800, Hillf Danton wrote:
> > > > -	load = task_h_load(env->p);
> > > > -	dst_load = env->dst_stats.load + load;
> > > > -	src_load = env->src_stats.load - load;
> > > > -
> > > >  	/*
> > > > -	 * If the improvement from just moving env->p direction is better
> > > > -	 * than swapping tasks around, check if a move is possible.
> > > > +	 * If dst node has spare capacity, then check if there is an
> > > > +	 * imbalance that would be overruled by the load balancer.
> > > >  	 */
> > > > -	maymove = !load_too_imbalanced(src_load, dst_load, env);
> > > > +	if (env->dst_stats.node_type == node_has_spare) {
> > > > +		unsigned int imbalance;
> > > > +		int src_running, dst_running;
> > > > +
> > > > +		/* Would movement cause an imbalance? */
> > > > +		src_running = env->src_stats.nr_running - 1;
> > > > +		dst_running = env->src_stats.nr_running + 1;
> > > > +		imbalance = max(0, dst_running - src_running);
> > > 
> > > Have trouble working out why 2 is magician again to make your test data nicer :P
> > > 
> > 
> > This is calculating what the nr_running would be after the move and
> > checking if an imbalance exists after the move forcing the load balancer
> > to intervene.
> 
> Isn't that always going to work out to 2? 
> 

Crap, stupid cut and paste moving between source trees. Yes, this is
broken.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 15:14               ` Mel Gorman
@ 2020-02-13 16:11                 ` Vincent Guittot
  2020-02-13 16:34                   ` Mel Gorman
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-13 16:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Hillf Danton, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Parth Shah, Valentin Schneider

On Thu, 13 Feb 2020 at 16:14, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Feb 13, 2020 at 10:00:26AM -0500, Phil Auld wrote:
> > On Thu, Feb 13, 2020 at 01:46:55PM +0000 Mel Gorman wrote:
> > > On Thu, Feb 13, 2020 at 09:16:58PM +0800, Hillf Danton wrote:
> > > > > -       load = task_h_load(env->p);
> > > > > -       dst_load = env->dst_stats.load + load;
> > > > > -       src_load = env->src_stats.load - load;
> > > > > -
> > > > >         /*
> > > > > -        * If the improvement from just moving env->p direction is better
> > > > > -        * than swapping tasks around, check if a move is possible.
> > > > > +        * If dst node has spare capacity, then check if there is an
> > > > > +        * imbalance that would be overruled by the load balancer.
> > > > >          */
> > > > > -       maymove = !load_too_imbalanced(src_load, dst_load, env);
> > > > > +       if (env->dst_stats.node_type == node_has_spare) {
> > > > > +               unsigned int imbalance;
> > > > > +               int src_running, dst_running;
> > > > > +
> > > > > +               /* Would movement cause an imbalance? */
> > > > > +               src_running = env->src_stats.nr_running - 1;
> > > > > +               dst_running = env->src_stats.nr_running + 1;
> > > > > +               imbalance = max(0, dst_running - src_running);
> > > >
> > > > Have trouble working out why 2 is magician again to make your test data nicer :P
> > > >
> > >
> > > This is calculating what the nr_running would be after the move and
> > > checking if an imbalance exists after the move forcing the load balancer
> > > to intervene.
> >
> > Isn't that always going to work out to 2?
> >
>
> Crap, stupid cut and paste moving between source trees. Yes, this is
> broken.

On the load balance side we have 2 rules when NUMA groups has spare capacity:
- ensure that the diff between src and dst nr_running < 2
- if src_nr_running is lower than 2, allow a degree of imbalance of 2
instead of 1

Your test doesn't explicitly ensure that the 1 condition is met

That being said, I'm not sure it's really a wrong thing ? I mean
load_balance will probably try to pull back some tasks on src but as
long as it is not a task with dst node as preferred node, it should
not be that harmfull

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 16:11                 ` Vincent Guittot
@ 2020-02-13 16:34                   ` Mel Gorman
  2020-02-13 16:38                     ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-13 16:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Phil Auld, Hillf Danton, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Parth Shah, Valentin Schneider

On Thu, Feb 13, 2020 at 05:11:23PM +0100, Vincent Guittot wrote:
> > On Thu, Feb 13, 2020 at 10:00:26AM -0500, Phil Auld wrote:
> > > On Thu, Feb 13, 2020 at 01:46:55PM +0000 Mel Gorman wrote:
> > > > On Thu, Feb 13, 2020 at 09:16:58PM +0800, Hillf Danton wrote:
> > > > > > -       load = task_h_load(env->p);
> > > > > > -       dst_load = env->dst_stats.load + load;
> > > > > > -       src_load = env->src_stats.load - load;
> > > > > > -
> > > > > >         /*
> > > > > > -        * If the improvement from just moving env->p direction is better
> > > > > > -        * than swapping tasks around, check if a move is possible.
> > > > > > +        * If dst node has spare capacity, then check if there is an
> > > > > > +        * imbalance that would be overruled by the load balancer.
> > > > > >          */
> > > > > > -       maymove = !load_too_imbalanced(src_load, dst_load, env);
> > > > > > +       if (env->dst_stats.node_type == node_has_spare) {
> > > > > > +               unsigned int imbalance;
> > > > > > +               int src_running, dst_running;
> > > > > > +
> > > > > > +               /* Would movement cause an imbalance? */
> > > > > > +               src_running = env->src_stats.nr_running - 1;
> > > > > > +               dst_running = env->src_stats.nr_running + 1;
> > > > > > +               imbalance = max(0, dst_running - src_running);
> > > > >
> > > > > Have trouble working out why 2 is magician again to make your test data nicer :P
> > > > >
> > > >
> > > > This is calculating what the nr_running would be after the move and
> > > > checking if an imbalance exists after the move forcing the load balancer
> > > > to intervene.
> > >
> > > Isn't that always going to work out to 2?
> > >
> >
> > Crap, stupid cut and paste moving between source trees. Yes, this is
> > broken.
> 
> On the load balance side we have 2 rules when NUMA groups has spare capacity:
> - ensure that the diff between src and dst nr_running < 2
> - if src_nr_running is lower than 2, allow a degree of imbalance of 2
> instead of 1
> 
> Your test doesn't explicitly ensure that the 1 condition is met
> 
> That being said, I'm not sure it's really a wrong thing ? I mean
> load_balance will probably try to pull back some tasks on src but as
> long as it is not a task with dst node as preferred node, it should
> not be that harmfull

My thinking was that if source has as many or more running tasks than
the destination *after* the move that it's not harmful and does not add
work for the load balancer.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 16:34                   ` Mel Gorman
@ 2020-02-13 16:38                     ` Vincent Guittot
  2020-02-13 17:02                       ` Mel Gorman
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2020-02-13 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Hillf Danton, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Parth Shah, Valentin Schneider

On Thu, 13 Feb 2020 at 17:34, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Feb 13, 2020 at 05:11:23PM +0100, Vincent Guittot wrote:
> > > On Thu, Feb 13, 2020 at 10:00:26AM -0500, Phil Auld wrote:
> > > > On Thu, Feb 13, 2020 at 01:46:55PM +0000 Mel Gorman wrote:
> > > > > On Thu, Feb 13, 2020 at 09:16:58PM +0800, Hillf Danton wrote:
> > > > > > > -       load = task_h_load(env->p);
> > > > > > > -       dst_load = env->dst_stats.load + load;
> > > > > > > -       src_load = env->src_stats.load - load;
> > > > > > > -
> > > > > > >         /*
> > > > > > > -        * If the improvement from just moving env->p direction is better
> > > > > > > -        * than swapping tasks around, check if a move is possible.
> > > > > > > +        * If dst node has spare capacity, then check if there is an
> > > > > > > +        * imbalance that would be overruled by the load balancer.
> > > > > > >          */
> > > > > > > -       maymove = !load_too_imbalanced(src_load, dst_load, env);
> > > > > > > +       if (env->dst_stats.node_type == node_has_spare) {
> > > > > > > +               unsigned int imbalance;
> > > > > > > +               int src_running, dst_running;
> > > > > > > +
> > > > > > > +               /* Would movement cause an imbalance? */
> > > > > > > +               src_running = env->src_stats.nr_running - 1;
> > > > > > > +               dst_running = env->src_stats.nr_running + 1;
> > > > > > > +               imbalance = max(0, dst_running - src_running);
> > > > > >
> > > > > > Have trouble working out why 2 is magician again to make your test data nicer :P
> > > > > >
> > > > >
> > > > > This is calculating what the nr_running would be after the move and
> > > > > checking if an imbalance exists after the move forcing the load balancer
> > > > > to intervene.
> > > >
> > > > Isn't that always going to work out to 2?
> > > >
> > >
> > > Crap, stupid cut and paste moving between source trees. Yes, this is
> > > broken.
> >
> > On the load balance side we have 2 rules when NUMA groups has spare capacity:
> > - ensure that the diff between src and dst nr_running < 2
> > - if src_nr_running is lower than 2, allow a degree of imbalance of 2
> > instead of 1
> >
> > Your test doesn't explicitly ensure that the 1 condition is met
> >
> > That being said, I'm not sure it's really a wrong thing ? I mean
> > load_balance will probably try to pull back some tasks on src but as
> > long as it is not a task with dst node as preferred node, it should
> > not be that harmfull
>
> My thinking was that if source has as many or more running tasks than
> the destination *after* the move that it's not harmful and does not add
> work for the load balancer.

load_balancer will see an imbalance but fbq_classify_group/queue
should be there to prevent from pulling back tasks that are on the
preferred node but only other tasks

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 16:38                     ` Vincent Guittot
@ 2020-02-13 17:02                       ` Mel Gorman
  2020-02-13 17:15                         ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Mel Gorman @ 2020-02-13 17:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Phil Auld, Hillf Danton, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Parth Shah, Valentin Schneider

On Thu, Feb 13, 2020 at 05:38:31PM +0100, Vincent Guittot wrote:
> > > Your test doesn't explicitly ensure that the 1 condition is met
> > >
> > > That being said, I'm not sure it's really a wrong thing ? I mean
> > > load_balance will probably try to pull back some tasks on src but as
> > > long as it is not a task with dst node as preferred node, it should
> > > not be that harmfull
> >
> > My thinking was that if source has as many or more running tasks than
> > the destination *after* the move that it's not harmful and does not add
> > work for the load balancer.
> 
> load_balancer will see an imbalance but fbq_classify_group/queue
> should be there to prevent from pulling back tasks that are on the
> preferred node but only other tasks
> 

Yes, exactly. Between fbq_classify and migrate_degrades_locality, I'm
expecting that the load balancer will only override NUMA balancing when
there is no better option. When the imbalance check, I want to avoid
the situation where NUMA balancing moves a task for locality, LB pulls
it back for balance, NUMA retries the move etc because it's stupid. The
locality matters but being continually dequeue/enqueue is unhelpful.

While there might be grounds for relaxing the degree an imbalance is
allowed across SD domains, I am avoiding looking in that direction again
until the load balancer and NUMA balancer stop overriding each other for
silly reasons (or the NUMA balancer fighting itself which can happen).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg
  2020-02-13 17:02                       ` Mel Gorman
@ 2020-02-13 17:15                         ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-13 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Hillf Danton, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Parth Shah, Valentin Schneider

On Thu, 13 Feb 2020 at 18:02, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Feb 13, 2020 at 05:38:31PM +0100, Vincent Guittot wrote:
> > > > Your test doesn't explicitly ensure that the 1 condition is met
> > > >
> > > > That being said, I'm not sure it's really a wrong thing ? I mean
> > > > load_balance will probably try to pull back some tasks on src but as
> > > > long as it is not a task with dst node as preferred node, it should
> > > > not be that harmfull
> > >
> > > My thinking was that if source has as many or more running tasks than
> > > the destination *after* the move that it's not harmful and does not add
> > > work for the load balancer.
> >
> > load_balancer will see an imbalance but fbq_classify_group/queue
> > should be there to prevent from pulling back tasks that are on the
> > preferred node but only other tasks
> >
>
> Yes, exactly. Between fbq_classify and migrate_degrades_locality, I'm
> expecting that the load balancer will only override NUMA balancing when
> there is no better option. When the imbalance check, I want to avoid
> the situation where NUMA balancing moves a task for locality, LB pulls
> it back for balance, NUMA retries the move etc because it's stupid. The
> locality matters but being continually dequeue/enqueue is unhelpful.
>
> While there might be grounds for relaxing the degree an imbalance is
> allowed across SD domains, I am avoiding looking in that direction again
> until the load balancer and NUMA balancer stop overriding each other for
> silly reasons (or the NUMA balancer fighting itself which can happen).

make sense

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 3/4] sched/fair: replace runnable load average by runnable average
  2020-02-11 17:46 ` [RFC 3/4] sched/fair: replace runnable load average by runnable average Vincent Guittot
  2020-02-12 14:30   ` Mel Gorman
@ 2020-02-13 17:36   ` Peter Zijlstra
  2020-02-14  7:43     ` Vincent Guittot
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2020-02-13 17:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	linux-kernel, pauld, parth, valentin.schneider

On Tue, Feb 11, 2020 at 06:46:50PM +0100, Vincent Guittot wrote:
> @@ -367,6 +367,14 @@ struct util_est {
>   * For cfs_rq, it is the aggregated load_avg of all runnable and
>   * blocked sched_entities.
>   *
> + * [runnable_avg definition]
> + *
> + *   runnable_avg = runnable%

				 * SCHED_CAPACITY_SCALE

right, just like we have for util_avg.

> + *
> + * where runnable% is the time ratio that a sched_entity is runnable.
> + * For cfs_rq, it is the aggregated runnable_avg of all runnable and
> + * blocked sched_entities.

Which is a verbatim repeat of the runnable% definition for load_avg,
right? Perhaps re-arrange the text such that we only have a single
definition for each symbol?

> + *
>   * [util_avg definition]
>   *
>   *   util_avg = running% * SCHED_CAPACITY_SCALE

Can you please split this patch in two? First remove everything
runnable_load_avg, and then introduce runnable_avg.

I didn't quickly spot anything off, and you're right that runnable vs
util is an interesting signal.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/4] sched/fair: Take into runnable_avg to classify group
  2020-02-11 17:46 ` [RFC 4/4] sched/fair: Take into runnable_avg to classify group Vincent Guittot
@ 2020-02-13 18:32   ` Valentin Schneider
  2020-02-13 18:37     ` Valentin Schneider
  0 siblings, 1 reply; 32+ messages in thread
From: Valentin Schneider @ 2020-02-13 18:32 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, linux-kernel
  Cc: pauld, parth

On 2/11/20 5:46 PM, Vincent Guittot wrote:
> Take into account the new runnable_avg signal to classify a group and to
> mitigate the volatility of util_avg in face of intensive migration.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d7cb207be30..5f8f12c902d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7691,6 +7691,7 @@ struct sg_lb_stats {
>  	unsigned long group_load; /* Total load over the CPUs of the group */
>  	unsigned long group_capacity;
>  	unsigned long group_util; /* Total utilization of the group */
> +	unsigned long group_runnable; /* Total utilization of the group */
                                               ^^^^^^^^^^^
"Total runnable" hurts my eyes, but in any case this shouldn't be just
"utilization".

>  	unsigned int sum_nr_running; /* Nr of tasks running in the group */
>  	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
>  	unsigned int idle_cpus;
> @@ -7911,6 +7912,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>  	if (sgs->sum_nr_running < sgs->group_weight)
>  		return true;
>  
> +	if ((sgs->group_capacity * imbalance_pct) <
> +			(sgs->group_runnable * 100))
> +		return false;
> +

I haven't stared long enough at patch 2, but I'll ask anyway - with this new
condition, do we still need the next one (based on util)? AIUI
group_runnable is >= group_util, so if group_runnable is within the allowed
margin then group_util has to be as well.

>  	if ((sgs->group_capacity * 100) >
>  			(sgs->group_util * imbalance_pct))
>  		return true;
> @@ -7936,6 +7941,10 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>  			(sgs->group_util * imbalance_pct))
>  		return true;
>  
> +	if ((sgs->group_capacity * imbalance_pct) <
> +			(sgs->group_runnable * 100))
> +		return true;
> +

Ditto on the group_runnable >= group_util - we could get rid of the check
above this one.

>  	return false;
>  }
>  
> @@ -8030,6 +8039,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
> +		sgs->group_runnable += cpu_runnable(rq);
>  		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
>  
>  		nr_running = rq->nr_running;
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/4] sched/fair: Take into runnable_avg to classify group
  2020-02-13 18:32   ` Valentin Schneider
@ 2020-02-13 18:37     ` Valentin Schneider
  2020-02-14  7:48       ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Valentin Schneider @ 2020-02-13 18:37 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, linux-kernel
  Cc: pauld, parth

On 2/13/20 6:32 PM, Valentin Schneider wrote:
>> @@ -7911,6 +7912,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>>  	if (sgs->sum_nr_running < sgs->group_weight)
>>  		return true;
>>  
>> +	if ((sgs->group_capacity * imbalance_pct) <
>> +			(sgs->group_runnable * 100))
>> +		return false;
>> +
> 
> I haven't stared long enough at patch 2, but I'll ask anyway - with this new
> condition, do we still need the next one (based on util)? AIUI
> group_runnable is >= group_util, so if group_runnable is within the allowed
> margin then group_util has to be as well.
> 

Hmph, actually util_est breaks the runnable >= util assumption I think...

>>  	if ((sgs->group_capacity * 100) >
>>  			(sgs->group_util * imbalance_pct))
>>  		return true;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 3/4] sched/fair: replace runnable load average by runnable average
  2020-02-12 14:30   ` Mel Gorman
@ 2020-02-14  7:42     ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-14  7:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Wed, 12 Feb 2020 at 15:30, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:50PM +0100, Vincent Guittot wrote:
> > Now that runnable_load_avg is not more used, we can replace it by a new
> > signal that will highlight the runnable pressure on a cfs_rq. This signal
> > track the waiting time of tasks on rq and can help to better define the
> > state of rqs.
> >
> > At now, only util_avg is used to define the state of a rq:
> >   A rq with more that around 80% of utilization and more than 1 tasks is
> >   considered as overloaded.
> >
> > But the util_avg signal of a rq can become temporaly low after that a task
> > migrated onto another rq which can bias the classification of the rq.
> >
> > When tasks compete for the same rq, their runnable average signal will be
> > higher than util_avg as it will include the waiting time and we can use
> > this signal to better classify cfs_rqs.
> >
> > The new runnable_avg will track the runnable time of a task which simply
> > adds the waiting time to the running time. The runnbale _avg of cfs_rq
> > will be the /Sum of se's runnable_avg and the runnable_avg of group entity
> > will follow the one of the rq similarly to util_avg.
> >
>
> s/runnbale/runnable/
>
> Otherwise, all I can do is give a heads-up that I will not be able to
> review this patch and the next patch properly in the short-term. While the
> new metric appears to have a sensible definition, I've not spent enough
> time comparing/contrasting the pro's and con's of PELT implementation
> details or their consequences. I am not confident I can accurately
> predict whether this is better or if there are corner cases that make
> poor placement decisions based on fast changes of runnable_avg. At least
> not within a reasonable amount of time.

ok. understood

>
> This caught my attention though
>
> > @@ -4065,8 +4018,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >        *   - Add its new weight to cfs_rq->load.weight
> >        */
> >       update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
> > +     se_update_runnable(se);
> >       update_cfs_group(se);
> > -     enqueue_runnable_load_avg(cfs_rq, se);
> >       account_entity_enqueue(cfs_rq, se);
> >
> >       if (flags & ENQUEUE_WAKEUP)
>
> I don't think the ordering matters any more because of what was removed
> from update_cfs_group. Unfortunately, I'm not 100% confident so am
> bringing it to your attention in case it does.

Yes. I have just tried to keep the same order with
se_update_runnable() just below update_load_avg() like for other
places

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 3/4] sched/fair: replace runnable load average by runnable average
  2020-02-13 17:36   ` Peter Zijlstra
@ 2020-02-14  7:43     ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-14  7:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider

On Thu, 13 Feb 2020 at 18:36, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:50PM +0100, Vincent Guittot wrote:
> > @@ -367,6 +367,14 @@ struct util_est {
> >   * For cfs_rq, it is the aggregated load_avg of all runnable and
> >   * blocked sched_entities.
> >   *
> > + * [runnable_avg definition]
> > + *
> > + *   runnable_avg = runnable%
>
>                                  * SCHED_CAPACITY_SCALE
>
> right, just like we have for util_avg.
>
> > + *
> > + * where runnable% is the time ratio that a sched_entity is runnable.
> > + * For cfs_rq, it is the aggregated runnable_avg of all runnable and
> > + * blocked sched_entities.
>
> Which is a verbatim repeat of the runnable% definition for load_avg,
> right? Perhaps re-arrange the text such that we only have a single
> definition for each symbol?

yes . will do

>
> > + *
> >   * [util_avg definition]
> >   *
> >   *   util_avg = running% * SCHED_CAPACITY_SCALE
>
> Can you please split this patch in two? First remove everything
> runnable_load_avg, and then introduce runnable_avg.

ok.

>
> I didn't quickly spot anything off, and you're right that runnable vs
> util is an interesting signal.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/4] sched/fair: Take into runnable_avg to classify group
  2020-02-13 18:37     ` Valentin Schneider
@ 2020-02-14  7:48       ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2020-02-14  7:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah

On Thu, 13 Feb 2020 at 19:37, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 2/13/20 6:32 PM, Valentin Schneider wrote:
> >> @@ -7911,6 +7912,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >>      if (sgs->sum_nr_running < sgs->group_weight)
> >>              return true;
> >>
> >> +    if ((sgs->group_capacity * imbalance_pct) <
> >> +                    (sgs->group_runnable * 100))
> >> +            return false;
> >> +
> >
> > I haven't stared long enough at patch 2, but I'll ask anyway - with this new
> > condition, do we still need the next one (based on util)? AIUI
> > group_runnable is >= group_util, so if group_runnable is within the allowed
> > margin then group_util has to be as well.
> >
>
> Hmph, actually util_est breaks the runnable >= util assumption I think...

yes, that's 1 reason

and also the 2 conditions are a bit different as  the imbalance_pct is
not on the same side of the condition.

For util_avg, the tests is true when util_avg is still below but close
to capacity
For runnable_avg, the test is true  when runnable is significantly
above capacity

>
> >>      if ((sgs->group_capacity * 100) >
> >>                      (sgs->group_util * imbalance_pct))
> >>              return true;

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-02-14  7:48 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-11 17:46 [PATCH 0/4] remove runnable_load_avg and improve group_classify Vincent Guittot
2020-02-11 17:46 ` [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path Vincent Guittot
2020-02-12 13:20   ` Mel Gorman
2020-02-12 14:47     ` Vincent Guittot
2020-02-12 16:11       ` Mel Gorman
2020-02-11 17:46 ` [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg Vincent Guittot
2020-02-12 13:37   ` Mel Gorman
2020-02-12 15:03     ` Vincent Guittot
2020-02-12 16:04       ` Mel Gorman
2020-02-12 19:49     ` Mel Gorman
2020-02-12 21:29       ` Mel Gorman
2020-02-13  8:05       ` Vincent Guittot
2020-02-13  9:24         ` Mel Gorman
     [not found]         ` <20200213131658.9600-1-hdanton@sina.com>
2020-02-13 13:46           ` Mel Gorman
2020-02-13 15:00             ` Phil Auld
2020-02-13 15:14               ` Mel Gorman
2020-02-13 16:11                 ` Vincent Guittot
2020-02-13 16:34                   ` Mel Gorman
2020-02-13 16:38                     ` Vincent Guittot
2020-02-13 17:02                       ` Mel Gorman
2020-02-13 17:15                         ` Vincent Guittot
2020-02-11 17:46 ` [RFC 3/4] sched/fair: replace runnable load average by runnable average Vincent Guittot
2020-02-12 14:30   ` Mel Gorman
2020-02-14  7:42     ` Vincent Guittot
2020-02-13 17:36   ` Peter Zijlstra
2020-02-14  7:43     ` Vincent Guittot
2020-02-11 17:46 ` [RFC 4/4] sched/fair: Take into runnable_avg to classify group Vincent Guittot
2020-02-13 18:32   ` Valentin Schneider
2020-02-13 18:37     ` Valentin Schneider
2020-02-14  7:48       ` Vincent Guittot
2020-02-11 21:04 ` [PATCH 0/4] remove runnable_load_avg and improve group_classify Mel Gorman
2020-02-12  8:16   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).