linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] remove runnable_load_avg and improve group_classify
@ 2020-02-14 15:27 Vincent Guittot
  2020-02-14 15:27 ` [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path Vincent Guittot
                   ` (6 more replies)
  0 siblings, 7 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-14 15:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

This new version stays quite close to the previous one and should 
replace without problems the previous one that part of Mel's patchset:
https://lkml.org/lkml/2020/2/14/156

NUMA load balancing is the last remaining piece of code that uses the 
runnable_load_avg of PELT to balance tasks between nodes. The normal
load_balance has replaced it by a better description of the current state
of the group of cpus.  The same policy can be applied to the numa
balancing.

Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
signal that tracks the waiting time of tasks on rq. Currently, the state
of a group of CPUs is defined thanks to the number of running task and the
level of utilization of rq. But the utilization can be temporarly low
after the migration of a task whereas the rq is still overloaded with
tasks. In such case where tasks were competing for the rq, the
runnable_avg will stay high after the migration.

Some hackbench results:

- small arm64 dual quad cores system
hackbench -l (2560/#grp) -g #grp

grp    tip/sched/core         +patchset              improvement
1       1,327(+/-10,06 %)     1,247(+/-5,45 %)       5,97 %
4       1,250(+/- 2,55 %)     1,207(+/-2,12 %)       3,42 %
8       1,189(+/- 1,47 %)     1,179(+/-1,93 %)       0,90 %
16      1,221(+/- 3,25 %)     1,219(+/-2,44 %)       0,16 %						

- large arm64 2 nodes / 224 cores system
hackbench -l (256000/#grp) -g #grp

grp    tip/sched/core         +patchset              improvement
1      14,197(+/- 2,73 %)     13,917(+/- 2,19 %)     1,98 %
4       6,817(+/- 1,27 %)      6,523(+/-11,96 %)     4,31 %
16      2,930(+/- 1,07 %)      2,911(+/- 1,08 %)     0,66 %
32      2,735(+/- 1,71 %)      2,725(+/- 1,53 %)     0,37 %
64      2,702(+/- 0,32 %)      2,717(+/- 1,07 %)    -0,53 %
128     3,533(+/-14,66 %)     3,123(+/-12,47 %)     11,59 %
256     3,918(+/-19,93 %)     3,390(+/- 5,93 %)     13,47 %

The significant improvement for 128 and 256 should be taken with care
because of some instabilities over iterations without the patchset.

The table below shows figures of the classification of sched group during
load balance (idle, newly or busy lb) with the disribution according to
the number of running tasks for:
    hackbench -l 640 -g 4 on octo cores

                 tip/sched/core  +patchset
state
has spare            3973        1934	
        nr_running					
            0        1965        1858
            1         518          56
            2         369          18
            3         270           2
            4+        851           0
						
fully busy            546        1018	
        nr_running					
            0           0           0
            1         546        1018
            2           0           0
            3           0           0
            4+          0           0
						
overloaded           2109        3056	
        nr_running					
            0           0           0
            1           0           0
            2         241         483
            3         170         348
            4+       1698        2225

total                6628        6008	

Without the patchset, there is a significant number of time that a CPU has
spare capacity with more than 1 running task. Although this is a valid
case, this is not a state that should often happen when 160 tasks are
competing on 8 cores like for this test. The patchset fixes the situation
by taking into account the runnable_avg, which stays high after the
migration of a task on another CPU.

Changes since RFC:
- fix come comment and typo
- split in 2 patches the removale of runnable_load_avg and the addition of
  runnbale_avg

Vincent Guittot (5):
  sched/fair: Reorder enqueue/dequeue_task_fair path
  sched/numa: Replace runnable_load_avg by load_avg
  sched/pelt: Remove unused runnable load average
  sched/pelt: Add a new runnable average signal
  sched/fair: Take into account runnable_avg to classify group


 include/linux/sched.h |  33 +++--
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |  17 +--
 kernel/sched/fair.c   | 337 ++++++++++++++++++++++--------------------
 kernel/sched/pelt.c   |  45 +++---
 kernel/sched/sched.h  |  29 +++-
 6 files changed, 247 insertions(+), 216 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH  v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
@ 2020-02-14 15:27 ` Vincent Guittot
  2020-02-18 12:37   ` Dietmar Eggemann
  2020-02-14 15:27 ` [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg Vincent Guittot
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-14 15:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.

Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.

In addition, the same steps ordering is used when updating a cfs_rq:
- update_load_avg
- update_cfs_group
- update *h_nr_running

No functional and performance changes are expected and have been noticed
during tests.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7e11b1bb64c..27450c4ddc81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5259,32 +5259,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		 */
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto enqueue_throttle;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running++;
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 	}
 
+enqueue_throttle:
 	if (!se) {
 		add_nr_running(rq, 1);
 		/*
@@ -5345,17 +5344,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto dequeue_throttle;
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5373,16 +5368,19 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running--;
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 	}
 
+dequeue_throttle:
 	if (!se)
 		sub_nr_running(rq, 1);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
  2020-02-14 15:27 ` [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path Vincent Guittot
@ 2020-02-14 15:27 ` Vincent Guittot
  2020-02-18 12:37   ` Dietmar Eggemann
  2020-02-18 14:54   ` Valentin Schneider
  2020-02-14 15:27 ` [PATCH v2 3/5] sched/pelt: Remove unused runnable load average Vincent Guittot
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-14 15:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

Similarly to what has been done for the normal load balancer, we can
replace runnable_load_avg by load_avg in numa load balancing and track the
other statistics like the utilization and the number of running tasks to
get to better view of the current state of a node.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 102 ++++++++++++++++++++++++++++++++------------
 1 file changed, 75 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 27450c4ddc81..12266b248f52 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
-
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
+/*
+ * 'numa_type' describes the node at the moment of load balancing.
+ */
+enum numa_type {
+	/* The node has spare capacity that can be used to run more tasks.  */
+	node_has_spare = 0,
+	/*
+	 * The node is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevertheless, some tasks might wait before running.
+	 */
+	node_fully_busy,
+	/*
+	 * The node is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
+	node_overloaded
+};
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
 	unsigned long load;
-
+	unsigned long util;
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
+	unsigned int nr_running;
+	unsigned int weight;
+	enum numa_type node_type;
 };
 
-/*
- * XXX borrowed from update_sg_lb_stats
- */
-static void update_numa_stats(struct numa_stats *ns, int nid)
-{
-	int cpu;
-
-	memset(ns, 0, sizeof(*ns));
-	for_each_cpu(cpu, cpumask_of_node(nid)) {
-		struct rq *rq = cpu_rq(cpu);
-
-		ns->load += cpu_runnable_load(rq);
-		ns->compute_capacity += capacity_of(cpu);
-	}
-
-}
-
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -1521,6 +1518,47 @@ struct task_numa_env {
 	int best_cpu;
 };
 
+static unsigned long cpu_load(struct rq *rq);
+static unsigned long cpu_util(int cpu);
+
+static inline enum
+numa_type numa_classify(unsigned int imbalance_pct,
+			 struct numa_stats *ns)
+{
+	if ((ns->nr_running > ns->weight) &&
+	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
+		return node_overloaded;
+
+	if ((ns->nr_running < ns->weight) ||
+	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
+		return node_has_spare;
+
+	return node_fully_busy;
+}
+
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct task_numa_env *env,
+			      struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->load += cpu_load(rq);
+		ns->util += cpu_util(cpu);
+		ns->nr_running += rq->cfs.h_nr_running;
+		ns->compute_capacity += capacity_of(cpu);
+	}
+
+	ns->weight = cpumask_weight(cpumask_of_node(nid));
+
+	ns->node_type = numa_classify(env->imbalance_pct, ns);
+}
+
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
@@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
+
+	/* If dst node has spare capacity, there is no real load imbalance */
+	if (env->dst_stats.node_type == node_has_spare)
+		return false;
+
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env.src_stats, env.src_nid);
+	update_numa_stats(&env, &env.src_stats, env.src_nid);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env.dst_stats, env.dst_nid);
+	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env.dst_stats, env.dst_nid);
+			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
@@ -5446,6 +5489,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
+static unsigned long cpu_runnable_load(struct rq *rq)
+{
+	return cfs_rq_runnable_load_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 3/5] sched/pelt: Remove unused runnable load average
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
  2020-02-14 15:27 ` [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path Vincent Guittot
  2020-02-14 15:27 ` [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg Vincent Guittot
@ 2020-02-14 15:27 ` Vincent Guittot
  2020-02-21  9:57   ` Dietmar Eggemann
  2020-02-14 15:27 ` [PATCH v2 4/5] sched/pelt: Add a new runnable average signal Vincent Guittot
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-14 15:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

Now that runnable_load_avg is no more used, we can remove it to make
space for a new signal.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |   5 +-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |   8 ---
 kernel/sched/fair.c   | 136 +++++-------------------------------------
 kernel/sched/pelt.c   |  62 ++++++++-----------
 kernel/sched/sched.h  |   7 +--
 6 files changed, 41 insertions(+), 179 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 467d26046416..f1cab3df8386 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -357,7 +357,7 @@ struct util_est {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
- * (see __update_load_avg() in kernel/sched/fair.c).
+ * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
@@ -401,11 +401,9 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
-	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
-	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -449,7 +447,6 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
-	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 377ec26e9159..6bd7686c8dc2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -761,7 +761,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	if (task_has_idle_policy(p)) {
 		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
-		p->se.runnable_weight = load->weight;
 		return;
 	}
 
@@ -774,7 +773,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	} else {
 		load->weight = scale_load(sched_prio_to_weight[prio]);
 		load->inv_weight = sched_prio_to_wmult[prio];
-		p->se.runnable_weight = load->weight;
 	}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 879d3ccf3806..cfecaad387c0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -402,11 +402,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	}
 
 	P(se->load.weight);
-	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
-	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,11 +522,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -947,13 +942,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
-	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
-	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
-	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12266b248f52..a23110ad96e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -741,9 +741,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-
-	se->runnable_weight = se->load.weight;
+		sa->load_avg = scale_load_down(se->load.weight);
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -2877,25 +2875,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 } while (0)
 
 #ifdef CONFIG_SMP
-static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight += se->runnable_weight;
-
-	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
-}
-
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight -= se->runnable_weight;
-
-	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
-	sub_positive(&cfs_rq->avg.runnable_load_sum,
-		     se_runnable(se) * se->avg.runnable_load_sum);
-}
-
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -2911,28 +2890,22 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 #else
 static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight, unsigned long runnable)
+			    unsigned long weight)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
 		account_entity_dequeue(cfs_rq, se);
-		dequeue_runnable_load_avg(cfs_rq, se);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
@@ -2940,15 +2913,12 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-		se->avg.runnable_load_avg =
-			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 	} while (0);
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
-		enqueue_runnable_load_avg(cfs_rq, se);
 	}
 }
 
@@ -2959,7 +2929,7 @@ void reweight_task(struct task_struct *p, int prio)
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight, weight);
+	reweight_entity(cfs_rq, se, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -3071,50 +3041,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	 */
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
-
-/*
- * This calculates the effective runnable weight for a group entity based on
- * the group entity weight calculated above.
- *
- * Because of the above approximation (2), our group entity weight is
- * an load_avg based ratio (3). This means that it includes blocked load and
- * does not represent the runnable weight.
- *
- * Approximate the group entity's runnable weight per ratio from the group
- * runqueue:
- *
- *					     grq->avg.runnable_load_avg
- *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
- *						 grq->avg.load_avg
- *
- * However, analogous to above, since the avg numbers are slow, this leads to
- * transients in the from-idle case. Instead we use:
- *
- *   ge->runnable_weight = ge->load.weight *
- *
- *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
- *		-----------------------------------------------------	(8)
- *		      max(grq->avg.load_avg, grq->load.weight)
- *
- * Where these max() serve both to use the 'instant' values to fix the slow
- * from-idle and avoid the /0 on to-idle, similar to (6).
- */
-static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
-{
-	long runnable, load_avg;
-
-	load_avg = max(cfs_rq->avg.load_avg,
-		       scale_load_down(cfs_rq->load.weight));
-
-	runnable = max(cfs_rq->avg.runnable_load_avg,
-		       scale_load_down(cfs_rq->runnable_weight));
-
-	runnable *= shares;
-	if (load_avg)
-		runnable /= load_avg;
-
-	return clamp_t(long, runnable, MIN_SHARES, shares);
-}
 #endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3126,7 +3052,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares, runnable;
+	long shares;
 
 	if (!gcfs_rq)
 		return;
@@ -3135,16 +3061,15 @@ static void update_cfs_group(struct sched_entity *se)
 		return;
 
 #ifndef CONFIG_SMP
-	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
+	shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
 	shares   = calc_group_shares(gcfs_rq);
-	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares, runnable);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -3269,11 +3194,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial and simply copies the running
- * sum over (but still wrong, because the group entity and group rq do not have
- * their PELT windows aligned).
+ * Per the above update_tg_cfs_util() is trivial  * and simply copies the
+ * running sum over (but still wrong, because the group entity and group rq do
+ * not have their PELT windows aligned).
  *
- * However, update_tg_cfs_runnable() is more complex. So we have:
+ * However, update_tg_cfs_load() is more complex. So we have:
  *
  *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
  *
@@ -3354,11 +3279,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 }
 
 static inline void
-update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
-	unsigned long runnable_load_avg, load_avg;
-	u64 runnable_load_sum, load_sum = 0;
+	unsigned long load_avg;
+	u64 load_sum = 0;
 	s64 delta_sum;
 
 	if (!runnable_sum)
@@ -3406,20 +3331,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.load_avg = load_avg;
 	add_positive(&cfs_rq->avg.load_avg, delta_avg);
 	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-
-	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
-	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
-
-	if (se->on_rq) {
-		delta_sum = runnable_load_sum -
-				se_weight(se) * se->avg.runnable_load_sum;
-		delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
-	}
-
-	se->avg.runnable_load_sum = runnable_sum;
-	se->avg.runnable_load_avg = runnable_load_avg;
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3447,7 +3358,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
-	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
 	trace_pelt_se_tp(se);
@@ -3592,8 +3503,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
 	}
 
-	se->avg.runnable_load_sum = se->avg.load_sum;
-
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
@@ -3728,11 +3637,6 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->avg.runnable_load_avg;
-}
-
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;
@@ -4059,14 +3963,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	update_cfs_group(se);
-	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -4143,13 +4045,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
-	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -5489,11 +5389,6 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
-
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7625,9 +7520,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b79b360..3eb0ed333dcb 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
+	       unsigned long load, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,8 +121,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -148,8 +146,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -186,7 +182,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
+		  unsigned long load, int running)
 {
 	u64 delta;
 
@@ -222,7 +218,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		runnable = running = 0;
+		running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -231,14 +227,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, runnable, running))
+	if (!accumulate_sum(delta, sa, load, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+___update_load_avg(struct sched_avg *sa, unsigned long load)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
@@ -246,7 +242,6 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -254,17 +249,13 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  * sched_entity:
  *
  *   task:
- *     se_runnable() == se_weight()
+ *     se_weight()   = se->load.weight
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
  *
- *   load_sum := runnable_sum
- *   load_avg = se_weight(se) * runnable_avg
- *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *   load_sum := runnable
+ *   load_avg = se_weight(se) * load_sum
  *
  * XXX collapse load_sum and runnable_load_sum
  *
@@ -272,15 +263,12 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+	if (___update_load_sum(now, &se->avg, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -290,10 +278,9 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
-				cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -306,10 +293,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
 				cfs_rq->curr != NULL)) {
 
-		___update_load_avg(&cfs_rq->avg, 1, 1);
+		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
 		return 1;
 	}
@@ -322,20 +308,19 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
  *
- *   load_avg and runnable_load_avg are not supported and meaningless.
+ *   load_avg is not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_rt, 1, 1);
+		___update_load_avg(&rq->avg_rt, 1);
 		trace_pelt_rt_tp(rq);
 		return 1;
 	}
@@ -348,18 +333,19 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg is not supported and meaningless.
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_dl, 1, 1);
+		___update_load_avg(&rq->avg_dl, 1);
 		trace_pelt_dl_tp(rq);
 		return 1;
 	}
@@ -373,7 +359,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg is not supported and meaningless.
  *
  */
 
@@ -401,16 +389,14 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * rq->clock += delta with delta >= running
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
-				0,
 				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
-				1,
 				1,
 				1);
 
 	if (ret) {
-		___update_load_avg(&rq->avg_irq, 1, 1);
+		___update_load_avg(&rq->avg_irq, 1);
 		trace_pelt_irq_tp(rq);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 878910e8b299..8760b656a349 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -489,7 +489,6 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
@@ -688,8 +687,10 @@ struct dl_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
+
 #else
 #define entity_is_task(se)	1
+
 #endif
 
 #ifdef CONFIG_SMP
@@ -701,10 +702,6 @@ static inline long se_weight(struct sched_entity *se)
 	return scale_load_down(se->load.weight);
 }
 
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
 
 static inline bool sched_asym_prefer(int a, int b)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
                   ` (2 preceding siblings ...)
  2020-02-14 15:27 ` [PATCH v2 3/5] sched/pelt: Remove unused runnable load average Vincent Guittot
@ 2020-02-14 15:27 ` Vincent Guittot
  2020-02-18 14:54   ` Valentin Schneider
                     ` (2 more replies)
  2020-02-14 15:27 ` [PATCH v2 5/5] sched/fair: Take into account runnable_avg to classify group Vincent Guittot
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-14 15:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
  A rq with more that around 80% of utilization and more than 1 tasks is
  considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 28 +++++++++------
 kernel/sched/debug.c  |  9 +++--
 kernel/sched/fair.c   | 79 ++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/pelt.c   | 51 +++++++++++++++++++---------
 kernel/sched/sched.h  | 22 +++++++++++-
 5 files changed, 151 insertions(+), 38 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f1cab3df8386..90cac1ba3c0b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,28 +356,30 @@ struct util_est {
 } __attribute__((__aligned__(sizeof(u64))));
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series
+ * The load/runnable/util_avg accumulates an infinite geometric series
  * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
  *   load_avg = runnable% * scale_load_down(load)
  *
- * where runnable% is the time ratio that a sched_entity is runnable.
- * For cfs_rq, it is the aggregated load_avg of all runnable and
- * blocked sched_entities.
+ * [runnable_avg definition]
  *
- * [util_avg definition]
+ *   runnable_avg = runnable% * SCHED_CAPACITY_SCALE
+ *
+* [util_avg definition]
  *
  *   util_avg = running% * SCHED_CAPACITY_SCALE
  *
- * where running% is the time ratio that a sched_entity is running on
- * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
- * and blocked sched_entities.
+ * where runnable% is the time ratio that a sched_entity is runnable and
+ * running% the time ratio that a sched_entity is running.
+ *
+ * For cfs_rq, they are the aggregated values of all runnable and blocked
+ * sched_entities.
  *
- * load_avg and util_avg don't direcly factor frequency scaling and CPU
- * capacity scaling. The scaling is done through the rq_clock_pelt that
- * is used for computing those signals (see update_rq_clock_pelt())
+ * The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU
+ * capacity scaling. The scaling is done through the rq_clock_pelt that is used
+ * for computing those signals (see update_rq_clock_pelt())
  *
  * N.B., the above ratios (runnable% and running%) themselves are in the
  * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
@@ -401,9 +403,11 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -467,6 +471,8 @@ struct sched_entity {
 	struct cfs_rq			*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq			*my_q;
+	/* cached value of my_q->h_nr_running */
+	unsigned long			runnable_weight;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cfecaad387c0..8331bc04aea2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -405,6 +405,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.runnable_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,6 +525,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
+			cfs_rq->avg.runnable_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
+			cfs_rq->removed.runnable_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
@@ -944,8 +947,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
+	P(se.avg.runnable_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
+	P(se.avg.runnable_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a23110ad96e8..3b6a90d6315c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,8 +740,10 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * Group entities are initialized with zero load to reflect the fact that
 	 * nothing has been attached to the task group yet.
 	 */
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		sa->runnable_avg = SCHED_CAPACITY_SCALE;
 		sa->load_avg = scale_load_down(se->load.weight);
+	}
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -3194,9 +3196,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial  * and simply copies the
- * running sum over (but still wrong, because the group entity and group rq do
- * not have their PELT windows aligned).
+ * Per the above update_tg_cfs_util() and update_tg_cfs_runnable() are trivial
+ * and simply copies the running/runnable sum over (but still wrong, because
+ * the group entity and group rq do not have their PELT windows aligned).
  *
  * However, update_tg_cfs_load() is more complex. So we have:
  *
@@ -3278,6 +3280,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
+static inline void
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+{
+	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
+	/* Set new sched_entity's runnable */
+	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
+	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq runnable */
+	add_positive(&cfs_rq->avg.runnable_avg, delta);
+	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
+}
+
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
@@ -3358,6 +3386,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
@@ -3428,7 +3457,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3439,7 +3468,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
+		swap(cfs_rq->removed.runnable_avg, removed_runnable);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3451,7 +3480,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
+		r = removed_runnable;
+		sub_positive(&sa->runnable_avg, r);
+		sub_positive(&sa->runnable_sum, r * divider);
+
+		/*
+		 * removed_runnable is the unweighted version of removed_load so we
+		 * can use it to estimate removed_load_sum.
+		 */
+		add_tg_cfs_propagate(cfs_rq,
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
 
 		decayed = 1;
 	}
@@ -3497,6 +3535,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 */
 	se->avg.util_sum = se->avg.util_avg * divider;
 
+	se->avg.runnable_sum = se->avg.runnable_avg * divider;
+
 	se->avg.load_sum = divider;
 	if (se_weight(se)) {
 		se->avg.load_sum =
@@ -3506,6 +3546,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	cfs_rq->avg.runnable_avg += se->avg.runnable_avg;
+	cfs_rq->avg.runnable_sum += se->avg.runnable_sum;
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
@@ -3527,6 +3569,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
+	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
@@ -3633,10 +3677,15 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
-	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
+	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
+static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->avg.runnable_avg;
+}
+
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;
@@ -3963,11 +4012,13 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	se_update_runnable(se);
 	update_cfs_group(se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -4045,11 +4096,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
+	se_update_runnable(se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -5220,6 +5273,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running++;
@@ -5317,6 +5371,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running--;
@@ -5389,6 +5444,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
+static unsigned long cpu_runnable(struct rq *rq)
+{
+	return cfs_rq_runnable_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7520,6 +7580,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
+	if (cfs_rq->avg.runnable_sum)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 3eb0ed333dcb..d6ec21450ffa 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, int running)
+	       unsigned long load, unsigned long runnable, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,6 +121,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
+		sa->runnable_sum =
+			decay_load(sa->runnable_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -146,6 +148,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -182,7 +186,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, int running)
+		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
 
@@ -218,7 +222,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		running = 0;
+		runnable = running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -227,14 +231,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, running))
+	if (!accumulate_sum(delta, sa, load, runnable, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load)
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
@@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_avg =	div_u64(runnable * sa->runnable_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -250,9 +255,14 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
  *
  *   task:
  *     se_weight()   = se->load.weight
+ *     se_runnable() = !!on_rq
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = grq->h_nr_running
+ *
+ *   runnable_sum = se_runnable() * runnable = grq->runnable_sum
+ *   runnable_avg = runnable_sum
  *
  *   load_sum := runnable
  *   load_avg = se_weight(se) * load_sum
@@ -261,14 +271,17 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
  *
  * cfq_rq:
  *
+ *   runnable_sum = \Sum se->avg.runnable_sum
+ *   runnable_avg = \Sum se->avg.runnable_avg
+ *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se));
+	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se), 1);
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -278,9 +291,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+				cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se));
+		___update_load_avg(&se->avg, se_weight(se), 1);
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -293,9 +307,10 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
+				cfs_rq->h_nr_running,
 				cfs_rq->curr != NULL)) {
 
-		___update_load_avg(&cfs_rq->avg, 1);
+		___update_load_avg(&cfs_rq->avg, 1, 1);
 		trace_pelt_cfs_tp(cfs_rq);
 		return 1;
 	}
@@ -310,17 +325,18 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
+				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_rt, 1);
+		___update_load_avg(&rq->avg_rt, 1, 1);
 		trace_pelt_rt_tp(rq);
 		return 1;
 	}
@@ -335,17 +351,18 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
+				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_dl, 1);
+		___update_load_avg(&rq->avg_dl, 1, 1);
 		trace_pelt_dl_tp(rq);
 		return 1;
 	}
@@ -361,7 +378,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
@@ -389,14 +406,16 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * rq->clock += delta with delta >= running
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
+				0,
 				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
+				1,
 				1,
 				1);
 
 	if (ret) {
-		___update_load_avg(&rq->avg_irq, 1);
+		___update_load_avg(&rq->avg_irq, 1, 1);
 		trace_pelt_irq_tp(rq);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8760b656a349..5a88d7665f63 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -527,7 +527,7 @@ struct cfs_rq {
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
-		unsigned long	runnable_sum;
+		unsigned long	runnable_avg;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -688,9 +688,29 @@ struct dl_rq {
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
 
+static inline void se_update_runnable(struct sched_entity *se)
+{
+	if (!entity_is_task(se))
+		se->runnable_weight = se->my_q->h_nr_running;
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return !!se->on_rq;
+	else
+		return se->runnable_weight;
+}
+
 #else
 #define entity_is_task(se)	1
 
+static inline void se_update_runnable(struct sched_entity *se) {}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return !!se->on_rq;
+}
 #endif
 
 #ifdef CONFIG_SMP
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 5/5] sched/fair: Take into account runnable_avg to classify group
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
                   ` (3 preceding siblings ...)
  2020-02-14 15:27 ` [PATCH v2 4/5] sched/pelt: Add a new runnable average signal Vincent Guittot
@ 2020-02-14 15:27 ` Vincent Guittot
  2020-02-15 21:58 ` [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Mel Gorman
  2020-02-21  9:58 ` Dietmar Eggemann
  6 siblings, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-14 15:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

Take into account the new runnable_avg signal to classify a group and to
mitigate the volatility of util_avg in face of intensive migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b6a90d6315c..99249a2484b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7718,7 +7718,8 @@ struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
-	unsigned long group_util; /* Total utilization of the group */
+	unsigned long group_util; /* Total utilization over the CPUs of the group */
+	unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
 	unsigned int sum_nr_running; /* Nr of tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
@@ -7939,6 +7940,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return false;
+
 	if ((sgs->group_capacity * 100) >
 			(sgs->group_util * imbalance_pct))
 		return true;
@@ -7964,6 +7969,10 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 			(sgs->group_util * imbalance_pct))
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return true;
+
 	return false;
 }
 
@@ -8058,6 +8067,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
+		sgs->group_runnable += cpu_runnable(rq);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 0/5] remove runnable_load_avg and improve group_classify
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
                   ` (4 preceding siblings ...)
  2020-02-14 15:27 ` [PATCH v2 5/5] sched/fair: Take into account runnable_avg to classify group Vincent Guittot
@ 2020-02-15 21:58 ` Mel Gorman
  2020-02-21  9:58 ` Dietmar Eggemann
  6 siblings, 0 replies; 49+ messages in thread
From: Mel Gorman @ 2020-02-15 21:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider, hdanton

On Fri, Feb 14, 2020 at 04:27:24PM +0100, Vincent Guittot wrote:
> This new version stays quite close to the previous one and should 
> replace without problems the previous one that part of Mel's patchset:
> https://lkml.org/lkml/2020/2/14/156
> 

As far as I can see, the differences are harmless and look sane. I do think
that an additional fix is mandatory as I see no reason why the regression
was fixed. As such, I'll release a v3 of the series that includes your
new patches with the minimal fix inserted where appropriate. I'll have
tests running over the rest of the weekend.

> Some hackbench results:
> 
> - small arm64 dual quad cores system
> hackbench -l (2560/#grp) -g #grp
> 
> grp    tip/sched/core         +patchset              improvement
> 1       1,327(+/-10,06 %)     1,247(+/-5,45 %)       5,97 %
> 4       1,250(+/- 2,55 %)     1,207(+/-2,12 %)       3,42 %
> 8       1,189(+/- 1,47 %)     1,179(+/-1,93 %)       0,90 %
> 16      1,221(+/- 3,25 %)     1,219(+/-2,44 %)       0,16 %						
> 
> - large arm64 2 nodes / 224 cores system
> hackbench -l (256000/#grp) -g #grp
> 
> grp    tip/sched/core         +patchset              improvement
> 1      14,197(+/- 2,73 %)     13,917(+/- 2,19 %)     1,98 %
> 4       6,817(+/- 1,27 %)      6,523(+/-11,96 %)     4,31 %
> 16      2,930(+/- 1,07 %)      2,911(+/- 1,08 %)     0,66 %
> 32      2,735(+/- 1,71 %)      2,725(+/- 1,53 %)     0,37 %
> 64      2,702(+/- 0,32 %)      2,717(+/- 1,07 %)    -0,53 %
> 128     3,533(+/-14,66 %)     3,123(+/-12,47 %)     11,59 %
> 256     3,918(+/-19,93 %)     3,390(+/- 5,93 %)     13,47 %
> 
> The significant improvement for 128 and 256 should be taken with care
> because of some instabilities over iterations without the patchset.
> 

For the most part I do not see similar results to this with hackbench with
one exception -- EPYC first generation. I don't have results for EPYC 2
yet but I'm curious if the machine you have has multiple L3 caches per
NUMA domain? Various Intel CPU generations show improvements but they're
not as dramatic.  Tests will tell me for sure but I have some confidence
that it'll look like

Small tracing patches -- no difference
Vincent Patches 1-2   -- regressions
Fix from Mel          -- small overall improvement on baseline
Vincent patches 3-5   -- small improvements mostly, sometimes big ones
                         on hackbench depending on the machine
Rest of Mel series    -- generally ok across machines and CPU generations

Even if the improvements are not dramatic, I think it'll be worth it to
have NUMA and CPU balancer using similarly sane logic and overall I find
the load balancer easier to understand with the new logic so yey!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-14 15:27 ` [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path Vincent Guittot
@ 2020-02-18 12:37   ` Dietmar Eggemann
  2020-02-18 13:22     ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-18 12:37 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton

On 14/02/2020 16:27, Vincent Guittot wrote:
> The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> is split in 2 distinct parts for throttled cfs_rq without any added value
> but making code less readable.
> 
> Change the code ordering such that everything related to a cfs_rq
> (throttled or not) will be done in the same loop.
> 
> In addition, the same steps ordering is used when updating a cfs_rq:
> - update_load_avg
> - update_cfs_group
> - update *h_nr_running

Is this code change really necessary? You pay with two extra goto's. We
still have the two for_each_sched_entity(se)'s because of 'if
(se->on_rq); break;'.

[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-14 15:27 ` [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg Vincent Guittot
@ 2020-02-18 12:37   ` Dietmar Eggemann
  2020-02-18 13:50     ` Mel Gorman
  2020-02-18 14:54   ` Valentin Schneider
  1 sibling, 1 reply; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-18 12:37 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton

On 14/02/2020 16:27, Vincent Guittot wrote:

[...]

>  	/*
>  	 * The load is corrected for the CPU capacity available on each node.
>  	 *
> @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
>  	taskweight = task_weight(p, env.src_nid, dist);
>  	groupweight = group_weight(p, env.src_nid, dist);
> -	update_numa_stats(&env.src_stats, env.src_nid);
> +	update_numa_stats(&env, &env.src_stats, env.src_nid);

This looks strange. Can you do:

-static void update_numa_stats(struct task_numa_env *env,
+static void update_numa_stats(unsigned int imbalance_pct,
                              struct numa_stats *ns, int nid)

-    update_numa_stats(&env, &env.src_stats, env.src_nid);
+    update_numa_stats(env.imbalance_pct, &env.src_stats, env.src_nid);

[...]

> +static unsigned long cpu_runnable_load(struct rq *rq)
> +{
> +	return cfs_rq_runnable_load_avg(&rq->cfs);
> +}
> +

Why not remove cpu_runnable_load() in this patch rather moving it?

kernel/sched/fair.c:5492:22: warning: ‘cpu_runnable_load’ defined but
not used [-Wunused-function]
 static unsigned long cpu_runnable_load(struct rq *rq)



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-18 12:37   ` Dietmar Eggemann
@ 2020-02-18 13:22     ` Peter Zijlstra
  2020-02-18 14:15       ` Vincent Guittot
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2020-02-18 13:22 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, juri.lelli, rostedt, bsegall, mgorman,
	linux-kernel, pauld, parth, valentin.schneider, hdanton

On Tue, Feb 18, 2020 at 01:37:37PM +0100, Dietmar Eggemann wrote:
> On 14/02/2020 16:27, Vincent Guittot wrote:
> > The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> > is split in 2 distinct parts for throttled cfs_rq without any added value
> > but making code less readable.
> > 
> > Change the code ordering such that everything related to a cfs_rq
> > (throttled or not) will be done in the same loop.
> > 
> > In addition, the same steps ordering is used when updating a cfs_rq:
> > - update_load_avg
> > - update_cfs_group
> > - update *h_nr_running
> 
> Is this code change really necessary? You pay with two extra goto's. We
> still have the two for_each_sched_entity(se)'s because of 'if
> (se->on_rq); break;'.

IIRC he relies on the presented ordering in patch #5 -- adding the
running_avg metric.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 12:37   ` Dietmar Eggemann
@ 2020-02-18 13:50     ` Mel Gorman
  2020-02-18 14:17       ` Vincent Guittot
  0 siblings, 1 reply; 49+ messages in thread
From: Mel Gorman @ 2020-02-18 13:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	linux-kernel, pauld, parth, valentin.schneider, hdanton

On Tue, Feb 18, 2020 at 01:37:45PM +0100, Dietmar Eggemann wrote:
> On 14/02/2020 16:27, Vincent Guittot wrote:
> 
> [...]
> 
> >  	/*
> >  	 * The load is corrected for the CPU capacity available on each node.
> >  	 *
> > @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
> >  	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
> >  	taskweight = task_weight(p, env.src_nid, dist);
> >  	groupweight = group_weight(p, env.src_nid, dist);
> > -	update_numa_stats(&env.src_stats, env.src_nid);
> > +	update_numa_stats(&env, &env.src_stats, env.src_nid);
> 
> This looks strange. Can you do:
> 
> -static void update_numa_stats(struct task_numa_env *env,
> +static void update_numa_stats(unsigned int imbalance_pct,
>                               struct numa_stats *ns, int nid)
> 
> -    update_numa_stats(&env, &env.src_stats, env.src_nid);
> +    update_numa_stats(env.imbalance_pct, &env.src_stats, env.src_nid);
> 

You'd also have to pass in env->p and while it could be done, I do not
think its worthwhile.

> [...]
> 
> > +static unsigned long cpu_runnable_load(struct rq *rq)
> > +{
> > +	return cfs_rq_runnable_load_avg(&rq->cfs);
> > +}
> > +
> 
> Why not remove cpu_runnable_load() in this patch rather moving it?
> 
> kernel/sched/fair.c:5492:22: warning: ???cpu_runnable_load??? defined but
> not used [-Wunused-function]
>  static unsigned long cpu_runnable_load(struct rq *rq)
> 

I took the liberty of addressing that when I picked up Vincent's patches
for "Reconcile NUMA balancing decisions with the load balancer v3" to fix
a build warning. I did not highlight it when I posted because it was such
a trivial change.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-18 13:22     ` Peter Zijlstra
@ 2020-02-18 14:15       ` Vincent Guittot
  2020-02-19 11:07         ` Dietmar Eggemann
  0 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-18 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Ingo Molnar, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On Tue, 18 Feb 2020 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Feb 18, 2020 at 01:37:37PM +0100, Dietmar Eggemann wrote:
> > On 14/02/2020 16:27, Vincent Guittot wrote:
> > > The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> > > is split in 2 distinct parts for throttled cfs_rq without any added value
> > > but making code less readable.
> > >
> > > Change the code ordering such that everything related to a cfs_rq
> > > (throttled or not) will be done in the same loop.
> > >
> > > In addition, the same steps ordering is used when updating a cfs_rq:
> > > - update_load_avg
> > > - update_cfs_group
> > > - update *h_nr_running
> >
> > Is this code change really necessary? You pay with two extra goto's. We
> > still have the two for_each_sched_entity(se)'s because of 'if
> > (se->on_rq); break;'.
>
> IIRC he relies on the presented ordering in patch #5 -- adding the
> running_avg metric.

Yes, that's the main reason, updating load_avg before h_nr_running

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 13:50     ` Mel Gorman
@ 2020-02-18 14:17       ` Vincent Guittot
  2020-02-18 14:42         ` Dietmar Eggemann
  0 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-18 14:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On Tue, 18 Feb 2020 at 14:51, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Feb 18, 2020 at 01:37:45PM +0100, Dietmar Eggemann wrote:
> > On 14/02/2020 16:27, Vincent Guittot wrote:
> >
> > [...]
> >
> > >     /*
> > >      * The load is corrected for the CPU capacity available on each node.
> > >      *
> > > @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
> > >     dist = env.dist = node_distance(env.src_nid, env.dst_nid);
> > >     taskweight = task_weight(p, env.src_nid, dist);
> > >     groupweight = group_weight(p, env.src_nid, dist);
> > > -   update_numa_stats(&env.src_stats, env.src_nid);
> > > +   update_numa_stats(&env, &env.src_stats, env.src_nid);
> >
> > This looks strange. Can you do:
> >
> > -static void update_numa_stats(struct task_numa_env *env,
> > +static void update_numa_stats(unsigned int imbalance_pct,
> >                               struct numa_stats *ns, int nid)
> >
> > -    update_numa_stats(&env, &env.src_stats, env.src_nid);
> > +    update_numa_stats(env.imbalance_pct, &env.src_stats, env.src_nid);
> >
>
> You'd also have to pass in env->p and while it could be done, I do not
> think its worthwhile.

I agree

>
> > [...]
> >
> > > +static unsigned long cpu_runnable_load(struct rq *rq)
> > > +{
> > > +   return cfs_rq_runnable_load_avg(&rq->cfs);
> > > +}
> > > +
> >
> > Why not remove cpu_runnable_load() in this patch rather moving it?
> >
> > kernel/sched/fair.c:5492:22: warning: ???cpu_runnable_load??? defined but
> > not used [-Wunused-function]
> >  static unsigned long cpu_runnable_load(struct rq *rq)
> >
>
> I took the liberty of addressing that when I picked up Vincent's patches
> for "Reconcile NUMA balancing decisions with the load balancer v3" to fix
> a build warning. I did not highlight it when I posted because it was such
> a trivial change.

yes I have noticed that.
Thanks

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 14:17       ` Vincent Guittot
@ 2020-02-18 14:42         ` Dietmar Eggemann
  0 siblings, 0 replies; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-18 14:42 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On 18/02/2020 15:17, Vincent Guittot wrote:
> On Tue, 18 Feb 2020 at 14:51, Mel Gorman <mgorman@suse.de> wrote:
>>
>> On Tue, Feb 18, 2020 at 01:37:45PM +0100, Dietmar Eggemann wrote:
>>> On 14/02/2020 16:27, Vincent Guittot wrote:

[...]

>>> -static void update_numa_stats(struct task_numa_env *env,
>>> +static void update_numa_stats(unsigned int imbalance_pct,
>>>                               struct numa_stats *ns, int nid)
>>>
>>> -    update_numa_stats(&env, &env.src_stats, env.src_nid);
>>> +    update_numa_stats(env.imbalance_pct, &env.src_stats, env.src_nid);
>>>
>>
>> You'd also have to pass in env->p and while it could be done, I do not
>> think its worthwhile.
> 
> I agree

Ah, another patch in Mel's patch-set:
https://lore.kernel.org/r/20200217104402.11643-11-mgorman@techsingularity.net

I see.

[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-14 15:27 ` [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg Vincent Guittot
  2020-02-18 12:37   ` Dietmar Eggemann
@ 2020-02-18 14:54   ` Valentin Schneider
  2020-02-18 15:33     ` Vincent Guittot
  2020-02-18 15:38     ` Mel Gorman
  1 sibling, 2 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-18 14:54 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, linux-kernel
  Cc: pauld, parth, hdanton

On 2/14/20 3:27 PM, Vincent Guittot wrote:
> @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
>  	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
>  }
>  
> -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> -
> -static unsigned long cpu_runnable_load(struct rq *rq)
> -{
> -	return cfs_rq_runnable_load_avg(&rq->cfs);
> -}
> +/*
> + * 'numa_type' describes the node at the moment of load balancing.
> + */
> +enum numa_type {
> +	/* The node has spare capacity that can be used to run more tasks.  */
> +	node_has_spare = 0,
> +	/*
> +	 * The node is fully used and the tasks don't compete for more CPU
> +	 * cycles. Nevertheless, some tasks might wait before running.
> +	 */
> +	node_fully_busy,
> +	/*
> +	 * The node is overloaded and can't provide expected CPU cycles to all
> +	 * tasks.
> +	 */
> +	node_overloaded
> +};

Could we reuse group_type instead? The definitions are the same modulo
s/group/node/.

>  
>  /* Cached statistics for all CPUs within a node */
>  struct numa_stats {
>  	unsigned long load;
> -
> +	unsigned long util;
>  	/* Total compute capacity of CPUs on a node */
>  	unsigned long compute_capacity;
> +	unsigned int nr_running;
> +	unsigned int weight;
> +	enum numa_type node_type;
>  };
>  
> -/*
> - * XXX borrowed from update_sg_lb_stats
> - */
> -static void update_numa_stats(struct numa_stats *ns, int nid)
> -{
> -	int cpu;
> -
> -	memset(ns, 0, sizeof(*ns));
> -	for_each_cpu(cpu, cpumask_of_node(nid)) {
> -		struct rq *rq = cpu_rq(cpu);
> -
> -		ns->load += cpu_runnable_load(rq);
> -		ns->compute_capacity += capacity_of(cpu);
> -	}
> -
> -}
> -
>  struct task_numa_env {
>  	struct task_struct *p;
>  
> @@ -1521,6 +1518,47 @@ struct task_numa_env {
>  	int best_cpu;
>  };
>  
> +static unsigned long cpu_load(struct rq *rq);
> +static unsigned long cpu_util(int cpu);
> +
> +static inline enum
> +numa_type numa_classify(unsigned int imbalance_pct,
> +			 struct numa_stats *ns)
> +{
> +	if ((ns->nr_running > ns->weight) &&
> +	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> +		return node_overloaded;
> +
> +	if ((ns->nr_running < ns->weight) ||
> +	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> +		return node_has_spare;
> +
> +	return node_fully_busy;
> +}
> +

As Mel pointed out, this is group_is_overloaded() and group_has_capacity().
@Mel, you mentioned having a common helper, do you have that laying around?
I haven't seen it in your reconciliation series.

What I'm naively thinking here is that we could have either move the whole
thing to just sg_lb_stats (AFAICT the fields of numa_stats are a subset of it),
or if we really care about the stack we could tweak the ordering to ensure
we can cast one into the other (not too enticed by that one though).

> +/*
> + * XXX borrowed from update_sg_lb_stats
> + */
> +static void update_numa_stats(struct task_numa_env *env,
> +			      struct numa_stats *ns, int nid)
> +{
> +	int cpu;
> +
> +	memset(ns, 0, sizeof(*ns));
> +	for_each_cpu(cpu, cpumask_of_node(nid)) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		ns->load += cpu_load(rq);
> +		ns->util += cpu_util(cpu);
> +		ns->nr_running += rq->cfs.h_nr_running;
> +		ns->compute_capacity += capacity_of(cpu);
> +	}
> +
> +	ns->weight = cpumask_weight(cpumask_of_node(nid));
> +
> +	ns->node_type = numa_classify(env->imbalance_pct, ns);
> +}
> +
>  static void task_numa_assign(struct task_numa_env *env,
>  			     struct task_struct *p, long imp)
>  {

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-14 15:27 ` [PATCH v2 4/5] sched/pelt: Add a new runnable average signal Vincent Guittot
@ 2020-02-18 14:54   ` Valentin Schneider
  2020-02-18 15:12     ` Peter Zijlstra
  2020-02-18 15:28     ` Vincent Guittot
  2020-02-18 21:19   ` Valentin Schneider
  2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
  2 siblings, 2 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-18 14:54 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, linux-kernel
  Cc: pauld, parth, hdanton

On 2/14/20 3:27 PM, Vincent Guittot wrote:
> @@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>  			cfs_rq->removed.load_avg);
>  	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>  			cfs_rq->removed.util_avg);
> -	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> -			cfs_rq->removed.runnable_sum);

Shouldn't that have been part of patch 3?

> +	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
> +			cfs_rq->removed.runnable_avg);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
>  			cfs_rq->tg_load_avg_contrib);
> @@ -3278,6 +3280,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
>  	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
>  }
>  
> +static inline void
> +update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> +{
> +	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
> +
> +	/* Nothing to update */
> +	if (!delta)
> +		return;
> +
> +	/*
> +	 * The relation between sum and avg is:
> +	 *
> +	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
> +	 *
> +	 * however, the PELT windows are not aligned between grq and gse.
> +	 */
> +
> +	/* Set new sched_entity's runnable */
> +	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
> +	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
> +
> +	/* Update parent cfs_rq runnable */
> +	add_positive(&cfs_rq->avg.runnable_avg, delta);
> +	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
> +}
> +

Humph, that's an exact copy of update_tg_cfs_util(). FWIW the following
eldritch horror compiles...

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99249a2484b4..be796532a2d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3254,57 +3254,34 @@ void set_task_rq_fair(struct sched_entity *se,
  *
  */
 
-static inline void
-update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
-{
-	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
-
-	/* Nothing to update */
-	if (!delta)
-		return;
-
-	/*
-	 * The relation between sum and avg is:
-	 *
-	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
-	 *
-	 * however, the PELT windows are not aligned between grq and gse.
-	 */
-
-	/* Set new sched_entity's utilization */
-	se->avg.util_avg = gcfs_rq->avg.util_avg;
-	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
-
-	/* Update parent cfs_rq utilization */
-	add_positive(&cfs_rq->avg.util_avg, delta);
-	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
-}
-
-static inline void
-update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
-{
-	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
-
-	/* Nothing to update */
-	if (!delta)
-		return;
-
-	/*
-	 * The relation between sum and avg is:
-	 *
-	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
-	 *
-	 * however, the PELT windows are not aligned between grq and gse.
-	 */
-
-	/* Set new sched_entity's runnable */
-	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
-	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
+#define DECLARE_UPDATE_TG_CFS_SIGNAL(signal)				\
+static inline void						\
+update_tg_cfs_##signal(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) \
+{								\
+	long delta = gcfs_rq->avg.signal##_avg - se->avg.signal##_avg; \
+								\
+	/* Nothing to update */					\
+	if (!delta)						\
+		return;						\
+								\
+	/*									\
+	 * The relation between sum and avg is:					\
+	 *									\
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib				\
+	 *									\
+		* however, the PELT windows are not aligned between grq and gse.	\
+	*/									\
+	/* Set new sched_entity's runnable */			\
+	se->avg.signal##_avg = gcfs_rq->avg.signal##_avg;	\
+	se->avg.signal##_sum = se->avg.signal##_avg * LOAD_AVG_MAX; \
+								\
+	/* Update parent cfs_rq signal## */			\
+	add_positive(&cfs_rq->avg.signal##_avg, delta);		\
+	cfs_rq->avg.signal##_sum = cfs_rq->avg.signal##_avg * LOAD_AVG_MAX; \
+}								\
 
-	/* Update parent cfs_rq runnable */
-	add_positive(&cfs_rq->avg.runnable_avg, delta);
-	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
-}
+DECLARE_UPDATE_TG_CFS_SIGNAL(util);
+DECLARE_UPDATE_TG_CFS_SIGNAL(runnable);
 
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
---

>  static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
>  {
> @@ -3358,6 +3386,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>  	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
>  
>  	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
> +	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
>  	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
>  
>  	trace_pelt_cfs_tp(cfs_rq);
> @@ -3439,7 +3468,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  		raw_spin_lock(&cfs_rq->removed.lock);
>  		swap(cfs_rq->removed.util_avg, removed_util);
>  		swap(cfs_rq->removed.load_avg, removed_load);
> -		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);

Ditto on the stray from patch 3?

> +		swap(cfs_rq->removed.runnable_avg, removed_runnable);
>  		cfs_rq->removed.nr = 0;
>  		raw_spin_unlock(&cfs_rq->removed.lock);
>  
> @@ -3451,7 +3480,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-18 14:54   ` Valentin Schneider
@ 2020-02-18 15:12     ` Peter Zijlstra
  2020-02-18 15:28     ` Vincent Guittot
  1 sibling, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2020-02-18 15:12 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, pauld, parth, hdanton

On Tue, Feb 18, 2020 at 02:54:40PM +0000, Valentin Schneider wrote:
> Humph, that's an exact copy of update_tg_cfs_util(). FWIW the following
> eldritch horror compiles...
> 

> +#define DECLARE_UPDATE_TG_CFS_SIGNAL(signal)				\
> +static inline void						\
> +update_tg_cfs_##signal(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) \
> +{								\
> +	long delta = gcfs_rq->avg.signal##_avg - se->avg.signal##_avg; \
> +								\
> +	/* Nothing to update */					\
> +	if (!delta)						\
> +		return;						\
> +								\
> +	/*									\
> +	 * The relation between sum and avg is:					\
> +	 *									\
> +	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib				\
> +	 *									\
> +		* however, the PELT windows are not aligned between grq and gse.	\
> +	*/									\
> +	/* Set new sched_entity's runnable */			\
> +	se->avg.signal##_avg = gcfs_rq->avg.signal##_avg;	\
> +	se->avg.signal##_sum = se->avg.signal##_avg * LOAD_AVG_MAX; \
> +								\
> +	/* Update parent cfs_rq signal## */			\
> +	add_positive(&cfs_rq->avg.signal##_avg, delta);		\
> +	cfs_rq->avg.signal##_sum = cfs_rq->avg.signal##_avg * LOAD_AVG_MAX; \
> +}								\
>  
> -	/* Update parent cfs_rq runnable */
> -	add_positive(&cfs_rq->avg.runnable_avg, delta);
> -	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
> -}
> +DECLARE_UPDATE_TG_CFS_SIGNAL(util);
> +DECLARE_UPDATE_TG_CFS_SIGNAL(runnable);

I'm not sure that's actually better though... :-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-18 14:54   ` Valentin Schneider
  2020-02-18 15:12     ` Peter Zijlstra
@ 2020-02-18 15:28     ` Vincent Guittot
  2020-02-18 16:30       ` Valentin Schneider
  1 sibling, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-18 15:28 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On Tue, 18 Feb 2020 at 15:54, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 2/14/20 3:27 PM, Vincent Guittot wrote:
> > @@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> >                       cfs_rq->removed.load_avg);
> >       SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
> >                       cfs_rq->removed.util_avg);
> > -     SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> > -                     cfs_rq->removed.runnable_sum);
>
> Shouldn't that have been part of patch 3?

No this was used to propagate load_avg

>
> > +     SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
> > +                     cfs_rq->removed.runnable_avg);
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >       SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
> >                       cfs_rq->tg_load_avg_contrib);
> > @@ -3278,6 +3280,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> >       cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
> >  }
> >
> > +static inline void
> > +update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> > +{
> > +     long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
> > +
> > +     /* Nothing to update */
> > +     if (!delta)
> > +             return;
> > +
> > +     /*
> > +      * The relation between sum and avg is:
> > +      *
> > +      *   LOAD_AVG_MAX - 1024 + sa->period_contrib
> > +      *
> > +      * however, the PELT windows are not aligned between grq and gse.
> > +      */
> > +
> > +     /* Set new sched_entity's runnable */
> > +     se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
> > +     se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
> > +
> > +     /* Update parent cfs_rq runnable */
> > +     add_positive(&cfs_rq->avg.runnable_avg, delta);
> > +     cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
> > +}
> > +
>
> Humph, that's an exact copy of update_tg_cfs_util(). FWIW the following
> eldritch horror compiles...
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99249a2484b4..be796532a2d3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3254,57 +3254,34 @@ void set_task_rq_fair(struct sched_entity *se,
>   *
>   */
>
> -static inline void
> -update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> -{
> -       long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
> -
> -       /* Nothing to update */
> -       if (!delta)
> -               return;
> -
> -       /*
> -        * The relation between sum and avg is:
> -        *
> -        *   LOAD_AVG_MAX - 1024 + sa->period_contrib
> -        *
> -        * however, the PELT windows are not aligned between grq and gse.
> -        */
> -
> -       /* Set new sched_entity's utilization */
> -       se->avg.util_avg = gcfs_rq->avg.util_avg;
> -       se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> -
> -       /* Update parent cfs_rq utilization */
> -       add_positive(&cfs_rq->avg.util_avg, delta);
> -       cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
> -}
> -
> -static inline void
> -update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> -{
> -       long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
> -
> -       /* Nothing to update */
> -       if (!delta)
> -               return;
> -
> -       /*
> -        * The relation between sum and avg is:
> -        *
> -        *   LOAD_AVG_MAX - 1024 + sa->period_contrib
> -        *
> -        * however, the PELT windows are not aligned between grq and gse.
> -        */
> -
> -       /* Set new sched_entity's runnable */
> -       se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
> -       se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
> +#define DECLARE_UPDATE_TG_CFS_SIGNAL(signal)                           \
> +static inline void                                             \
> +update_tg_cfs_##signal(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) \
> +{                                                              \
> +       long delta = gcfs_rq->avg.signal##_avg - se->avg.signal##_avg; \
> +                                                               \
> +       /* Nothing to update */                                 \
> +       if (!delta)                                             \
> +               return;                                         \
> +                                                               \
> +       /*                                                                      \
> +        * The relation between sum and avg is:                                 \
> +        *                                                                      \
> +        *   LOAD_AVG_MAX - 1024 + sa->period_contrib                           \
> +        *                                                                      \
> +               * however, the PELT windows are not aligned between grq and gse.        \
> +       */                                                                      \
> +       /* Set new sched_entity's runnable */                   \
> +       se->avg.signal##_avg = gcfs_rq->avg.signal##_avg;       \
> +       se->avg.signal##_sum = se->avg.signal##_avg * LOAD_AVG_MAX; \
> +                                                               \
> +       /* Update parent cfs_rq signal## */                     \
> +       add_positive(&cfs_rq->avg.signal##_avg, delta);         \
> +       cfs_rq->avg.signal##_sum = cfs_rq->avg.signal##_avg * LOAD_AVG_MAX; \
> +}                                                              \
>
> -       /* Update parent cfs_rq runnable */
> -       add_positive(&cfs_rq->avg.runnable_avg, delta);
> -       cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
> -}
> +DECLARE_UPDATE_TG_CFS_SIGNAL(util);
> +DECLARE_UPDATE_TG_CFS_SIGNAL(runnable);

TBH, I prefer keeping the current version which is easier to read IMO

>
>  static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> ---
>
> >  static inline void
> >  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> >  {
> > @@ -3358,6 +3386,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> >       add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
> >
> >       update_tg_cfs_util(cfs_rq, se, gcfs_rq);
> > +     update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
> >       update_tg_cfs_load(cfs_rq, se, gcfs_rq);
> >
> >       trace_pelt_cfs_tp(cfs_rq);
> > @@ -3439,7 +3468,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >               raw_spin_lock(&cfs_rq->removed.lock);
> >               swap(cfs_rq->removed.util_avg, removed_util);
> >               swap(cfs_rq->removed.load_avg, removed_load);
> > -             swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
>
> Ditto on the stray from patch 3?

same as 1st point above
>
> > +             swap(cfs_rq->removed.runnable_avg, removed_runnable);
> >               cfs_rq->removed.nr = 0;
> >               raw_spin_unlock(&cfs_rq->removed.lock);
> >
> > @@ -3451,7 +3480,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 14:54   ` Valentin Schneider
@ 2020-02-18 15:33     ` Vincent Guittot
  2020-02-18 15:38     ` Mel Gorman
  1 sibling, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-18 15:33 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On Tue, 18 Feb 2020 at 15:54, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 2/14/20 3:27 PM, Vincent Guittot wrote:
> > @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> >              group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> >  }
> >
> > -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> > -
> > -static unsigned long cpu_runnable_load(struct rq *rq)
> > -{
> > -     return cfs_rq_runnable_load_avg(&rq->cfs);
> > -}
> > +/*
> > + * 'numa_type' describes the node at the moment of load balancing.
> > + */
> > +enum numa_type {
> > +     /* The node has spare capacity that can be used to run more tasks.  */
> > +     node_has_spare = 0,
> > +     /*
> > +      * The node is fully used and the tasks don't compete for more CPU
> > +      * cycles. Nevertheless, some tasks might wait before running.
> > +      */
> > +     node_fully_busy,
> > +     /*
> > +      * The node is overloaded and can't provide expected CPU cycles to all
> > +      * tasks.
> > +      */
> > +     node_overloaded
> > +};
>
> Could we reuse group_type instead? The definitions are the same modulo
> s/group/node/.

Also, imbalance might have but misfit and asym have no meaning at NUMA level

For now I prefer to keep them separate to ease code readability. Also
more changes will come on top of this for NUMA balancing which could
also ends up with numa dedicated states

>
> >
> >  /* Cached statistics for all CPUs within a node */
> >  struct numa_stats {
> >       unsigned long load;
> > -
> > +     unsigned long util;
> >       /* Total compute capacity of CPUs on a node */
> >       unsigned long compute_capacity;
> > +     unsigned int nr_running;
> > +     unsigned int weight;
> > +     enum numa_type node_type;
> >  };
> >
> > -/*
> > - * XXX borrowed from update_sg_lb_stats
> > - */
> > -static void update_numa_stats(struct numa_stats *ns, int nid)
> > -{
> > -     int cpu;
> > -
> > -     memset(ns, 0, sizeof(*ns));
> > -     for_each_cpu(cpu, cpumask_of_node(nid)) {
> > -             struct rq *rq = cpu_rq(cpu);
> > -
> > -             ns->load += cpu_runnable_load(rq);
> > -             ns->compute_capacity += capacity_of(cpu);
> > -     }
> > -
> > -}
> > -
> >  struct task_numa_env {
> >       struct task_struct *p;
> >
> > @@ -1521,6 +1518,47 @@ struct task_numa_env {
> >       int best_cpu;
> >  };
> >
> > +static unsigned long cpu_load(struct rq *rq);
> > +static unsigned long cpu_util(int cpu);
> > +
> > +static inline enum
> > +numa_type numa_classify(unsigned int imbalance_pct,
> > +                      struct numa_stats *ns)
> > +{
> > +     if ((ns->nr_running > ns->weight) &&
> > +         ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> > +             return node_overloaded;
> > +
> > +     if ((ns->nr_running < ns->weight) ||
> > +         ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> > +             return node_has_spare;
> > +
> > +     return node_fully_busy;
> > +}
> > +
>
> As Mel pointed out, this is group_is_overloaded() and group_has_capacity().
> @Mel, you mentioned having a common helper, do you have that laying around?
> I haven't seen it in your reconciliation series.
>
> What I'm naively thinking here is that we could have either move the whole
> thing to just sg_lb_stats (AFAICT the fields of numa_stats are a subset of it),
> or if we really care about the stack we could tweak the ordering to ensure
> we can cast one into the other (not too enticed by that one though).
>
> > +/*
> > + * XXX borrowed from update_sg_lb_stats
> > + */
> > +static void update_numa_stats(struct task_numa_env *env,
> > +                           struct numa_stats *ns, int nid)
> > +{
> > +     int cpu;
> > +
> > +     memset(ns, 0, sizeof(*ns));
> > +     for_each_cpu(cpu, cpumask_of_node(nid)) {
> > +             struct rq *rq = cpu_rq(cpu);
> > +
> > +             ns->load += cpu_load(rq);
> > +             ns->util += cpu_util(cpu);
> > +             ns->nr_running += rq->cfs.h_nr_running;
> > +             ns->compute_capacity += capacity_of(cpu);
> > +     }
> > +
> > +     ns->weight = cpumask_weight(cpumask_of_node(nid));
> > +
> > +     ns->node_type = numa_classify(env->imbalance_pct, ns);
> > +}
> > +
> >  static void task_numa_assign(struct task_numa_env *env,
> >                            struct task_struct *p, long imp)
> >  {

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 14:54   ` Valentin Schneider
  2020-02-18 15:33     ` Vincent Guittot
@ 2020-02-18 15:38     ` Mel Gorman
  2020-02-18 16:50       ` Valentin Schneider
  2020-02-18 16:51       ` Vincent Guittot
  1 sibling, 2 replies; 49+ messages in thread
From: Mel Gorman @ 2020-02-18 15:38 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, linux-kernel, pauld, parth, hdanton

On Tue, Feb 18, 2020 at 02:54:14PM +0000, Valentin Schneider wrote:
> On 2/14/20 3:27 PM, Vincent Guittot wrote:
> > @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> >  	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> >  }
> >  
> > -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> > -
> > -static unsigned long cpu_runnable_load(struct rq *rq)
> > -{
> > -	return cfs_rq_runnable_load_avg(&rq->cfs);
> > -}
> > +/*
> > + * 'numa_type' describes the node at the moment of load balancing.
> > + */
> > +enum numa_type {
> > +	/* The node has spare capacity that can be used to run more tasks.  */
> > +	node_has_spare = 0,
> > +	/*
> > +	 * The node is fully used and the tasks don't compete for more CPU
> > +	 * cycles. Nevertheless, some tasks might wait before running.
> > +	 */
> > +	node_fully_busy,
> > +	/*
> > +	 * The node is overloaded and can't provide expected CPU cycles to all
> > +	 * tasks.
> > +	 */
> > +	node_overloaded
> > +};
> 
> Could we reuse group_type instead? The definitions are the same modulo
> s/group/node/.
> 

I kept the naming because there is the remote possibility that NUMA
balancing will deviate in some fashion. Right now, it's harmless.

> >  
> >  /* Cached statistics for all CPUs within a node */
> >  struct numa_stats {
> >  	unsigned long load;
> > -
> > +	unsigned long util;
> >  	/* Total compute capacity of CPUs on a node */
> >  	unsigned long compute_capacity;
> > +	unsigned int nr_running;
> > +	unsigned int weight;
> > +	enum numa_type node_type;
> >  };
> >  
> > -/*
> > - * XXX borrowed from update_sg_lb_stats
> > - */
> > -static void update_numa_stats(struct numa_stats *ns, int nid)
> > -{
> > -	int cpu;
> > -
> > -	memset(ns, 0, sizeof(*ns));
> > -	for_each_cpu(cpu, cpumask_of_node(nid)) {
> > -		struct rq *rq = cpu_rq(cpu);
> > -
> > -		ns->load += cpu_runnable_load(rq);
> > -		ns->compute_capacity += capacity_of(cpu);
> > -	}
> > -
> > -}
> > -
> >  struct task_numa_env {
> >  	struct task_struct *p;
> >  
> > @@ -1521,6 +1518,47 @@ struct task_numa_env {
> >  	int best_cpu;
> >  };
> >  
> > +static unsigned long cpu_load(struct rq *rq);
> > +static unsigned long cpu_util(int cpu);
> > +
> > +static inline enum
> > +numa_type numa_classify(unsigned int imbalance_pct,
> > +			 struct numa_stats *ns)
> > +{
> > +	if ((ns->nr_running > ns->weight) &&
> > +	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> > +		return node_overloaded;
> > +
> > +	if ((ns->nr_running < ns->weight) ||
> > +	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> > +		return node_has_spare;
> > +
> > +	return node_fully_busy;
> > +}
> > +
> 
> As Mel pointed out, this is group_is_overloaded() and group_has_capacity().
> @Mel, you mentioned having a common helper, do you have that laying around?
> I haven't seen it in your reconciliation series.
> 

I didn't merge that part of the first version of my series. I was
waiting to see how the implementation for allowing a small degree of
imbalance looks like. If it's entirely confined in adjust_numa_balance
then I'll create the common helper at the same time. For now, I left the
possibility open that numa_classify would use something different than
group_is_overloaded or group_has_capacity even if I find that hard to
imagine at the moment.

> What I'm naively thinking here is that we could have either move the whole
> thing to just sg_lb_stats (AFAICT the fields of numa_stats are a subset of it),
> or if we really care about the stack we could tweak the ordering to ensure
> we can cast one into the other (not too enticed by that one though).
> 

Yikes, no I'd rather not do that. Basically all I did before was create
a common helper like __lb_has_capacity that only took basic types as
parameters. group_has_capacity and numa_has_capacity were simple wrappers
that read the correct fields from their respective stats structures.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-18 15:28     ` Vincent Guittot
@ 2020-02-18 16:30       ` Valentin Schneider
  0 siblings, 0 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-18 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On 18/02/2020 15:28, Vincent Guittot wrote:
>>> @@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>>>                       cfs_rq->removed.load_avg);
>>>       SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>>>                       cfs_rq->removed.util_avg);
>>> -     SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
>>> -                     cfs_rq->removed.runnable_sum);
>>
>> Shouldn't that have been part of patch 3?
> 
> No this was used to propagate load_avg
> 

Right, sorry about that.

>> <fugly here>
>> +DECLARE_UPDATE_TG_CFS_SIGNAL(util);
>> +DECLARE_UPDATE_TG_CFS_SIGNAL(runnable);
> 
> TBH, I prefer keeping the current version which is easier to read IMO
> 

I did call it an eldritch horror :-) I agree with you here, it's a shame we
don't really have better ways to factorize this (I don't think splitting the
inputs is really an alternative).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 15:38     ` Mel Gorman
@ 2020-02-18 16:50       ` Valentin Schneider
  2020-02-18 17:41         ` Mel Gorman
  2020-02-18 16:51       ` Vincent Guittot
  1 sibling, 1 reply; 49+ messages in thread
From: Valentin Schneider @ 2020-02-18 16:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, linux-kernel, pauld, parth, hdanton

On 18/02/2020 15:38, Mel Gorman wrote:
>>
>> Could we reuse group_type instead? The definitions are the same modulo
>> s/group/node/.
>>
> 
> I kept the naming because there is the remote possibility that NUMA
> balancing will deviate in some fashion. Right now, it's harmless.
> 

Since it's just a subset ATM I'd go for the reuse and change that later if
shown a split is required, but fair enough.

> I didn't merge that part of the first version of my series. I was
> waiting to see how the implementation for allowing a small degree of
> imbalance looks like. If it's entirely confined in adjust_numa_balance
                                                     ^^^^^^^^^^^^^^^^^^^
Apologies if that's a newbie question, but I'm not familiar with that one.
Would that be added in your reconciliation series? I've only had a brief
look at it yet (it's next on the chopping block).

> then I'll create the common helper at the same time. For now, I left the
> possibility open that numa_classify would use something different than
> group_is_overloaded or group_has_capacity even if I find that hard to
> imagine at the moment.
> 
>> What I'm naively thinking here is that we could have either move the whole
>> thing to just sg_lb_stats (AFAICT the fields of numa_stats are a subset of it),
>> or if we really care about the stack we could tweak the ordering to ensure
>> we can cast one into the other (not too enticed by that one though).
>>
> 
> Yikes, no I'd rather not do that. Basically all I did before was create
> a common helper like __lb_has_capacity that only took basic types as
> parameters. group_has_capacity and numa_has_capacity were simple wrappers
> that read the correct fields from their respective stats structures.
> 

That's more sensible indeed. It'd definitely be nice to actually reconcile
the two balancers with these common helpers, though I guess it doesn't
*have* to happen with this very patch.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 15:38     ` Mel Gorman
  2020-02-18 16:50       ` Valentin Schneider
@ 2020-02-18 16:51       ` Vincent Guittot
  1 sibling, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-18 16:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Phil Auld, Parth Shah, Hillf Danton

On Tue, 18 Feb 2020 at 16:38, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Feb 18, 2020 at 02:54:14PM +0000, Valentin Schneider wrote:
> > On 2/14/20 3:27 PM, Vincent Guittot wrote:
> > > @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> > >            group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> > >  }
> > >
> > > -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> > > -
> > > -static unsigned long cpu_runnable_load(struct rq *rq)
> > > -{
> > > -   return cfs_rq_runnable_load_avg(&rq->cfs);
> > > -}
> > > +/*
> > > + * 'numa_type' describes the node at the moment of load balancing.
> > > + */
> > > +enum numa_type {
> > > +   /* The node has spare capacity that can be used to run more tasks.  */
> > > +   node_has_spare = 0,
> > > +   /*
> > > +    * The node is fully used and the tasks don't compete for more CPU
> > > +    * cycles. Nevertheless, some tasks might wait before running.
> > > +    */
> > > +   node_fully_busy,
> > > +   /*
> > > +    * The node is overloaded and can't provide expected CPU cycles to all
> > > +    * tasks.
> > > +    */
> > > +   node_overloaded
> > > +};
> >
> > Could we reuse group_type instead? The definitions are the same modulo
> > s/group/node/.
> >
>
> I kept the naming because there is the remote possibility that NUMA
> balancing will deviate in some fashion. Right now, it's harmless.

+1
This 1st round mainly aims to align NUMA way of working with load
balance but we can imagine using some NUMA or memory specific
information to create new state

>
> > >
> > >  /* Cached statistics for all CPUs within a node */
> > >  struct numa_stats {
> > >     unsigned long load;
> > > -
> > > +   unsigned long util;
> > >     /* Total compute capacity of CPUs on a node */
> > >     unsigned long compute_capacity;
> > > +   unsigned int nr_running;
> > > +   unsigned int weight;
> > > +   enum numa_type node_type;
> > >  };
> > >
> > > -/*
> > > - * XXX borrowed from update_sg_lb_stats
> > > - */
> > > -static void update_numa_stats(struct numa_stats *ns, int nid)
> > > -{
> > > -   int cpu;
> > > -
> > > -   memset(ns, 0, sizeof(*ns));
> > > -   for_each_cpu(cpu, cpumask_of_node(nid)) {
> > > -           struct rq *rq = cpu_rq(cpu);
> > > -
> > > -           ns->load += cpu_runnable_load(rq);
> > > -           ns->compute_capacity += capacity_of(cpu);
> > > -   }
> > > -
> > > -}
> > > -
> > >  struct task_numa_env {
> > >     struct task_struct *p;
> > >
> > > @@ -1521,6 +1518,47 @@ struct task_numa_env {
> > >     int best_cpu;
> > >  };
> > >
> > > +static unsigned long cpu_load(struct rq *rq);
> > > +static unsigned long cpu_util(int cpu);
> > > +
> > > +static inline enum
> > > +numa_type numa_classify(unsigned int imbalance_pct,
> > > +                    struct numa_stats *ns)
> > > +{
> > > +   if ((ns->nr_running > ns->weight) &&
> > > +       ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> > > +           return node_overloaded;
> > > +
> > > +   if ((ns->nr_running < ns->weight) ||
> > > +       ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> > > +           return node_has_spare;
> > > +
> > > +   return node_fully_busy;
> > > +}
> > > +
> >
> > As Mel pointed out, this is group_is_overloaded() and group_has_capacity().
> > @Mel, you mentioned having a common helper, do you have that laying around?
> > I haven't seen it in your reconciliation series.
> >
>
> I didn't merge that part of the first version of my series. I was
> waiting to see how the implementation for allowing a small degree of
> imbalance looks like. If it's entirely confined in adjust_numa_balance
> then I'll create the common helper at the same time. For now, I left the
> possibility open that numa_classify would use something different than
> group_is_overloaded or group_has_capacity even if I find that hard to
> imagine at the moment.
>
> > What I'm naively thinking here is that we could have either move the whole
> > thing to just sg_lb_stats (AFAICT the fields of numa_stats are a subset of it),
> > or if we really care about the stack we could tweak the ordering to ensure
> > we can cast one into the other (not too enticed by that one though).
> >
>
> Yikes, no I'd rather not do that. Basically all I did before was create
> a common helper like __lb_has_capacity that only took basic types as
> parameters. group_has_capacity and numa_has_capacity were simple wrappers
> that read the correct fields from their respective stats structures.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 16:50       ` Valentin Schneider
@ 2020-02-18 17:41         ` Mel Gorman
  2020-02-18 17:54           ` Valentin Schneider
  0 siblings, 1 reply; 49+ messages in thread
From: Mel Gorman @ 2020-02-18 17:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, linux-kernel, pauld, parth, hdanton

On Tue, Feb 18, 2020 at 04:50:48PM +0000, Valentin Schneider wrote:
> On 18/02/2020 15:38, Mel Gorman wrote:
> >>
> >> Could we reuse group_type instead? The definitions are the same modulo
> >> s/group/node/.
> >>
> > 
> > I kept the naming because there is the remote possibility that NUMA
> > balancing will deviate in some fashion. Right now, it's harmless.
> > 
> 
> Since it's just a subset ATM I'd go for the reuse and change that later if
> shown a split is required, but fair enough.
> 

I would feel that I was churning code for the sake of it.

> > I didn't merge that part of the first version of my series. I was
> > waiting to see how the implementation for allowing a small degree of
> > imbalance looks like. If it's entirely confined in adjust_numa_balance
>                                                      ^^^^^^^^^^^^^^^^^^^
> Apologies if that's a newbie question, but I'm not familiar with that one.
> Would that be added in your reconciliation series? I've only had a brief
> look at it yet (it's next on the chopping block).
> 

I should have wrote adjust_numa_imbalance but yes, it's part of the
reconciled series so that NUMA balancing and the load balancer use the
same helper.

> > Yikes, no I'd rather not do that. Basically all I did before was create
> > a common helper like __lb_has_capacity that only took basic types as
> > parameters. group_has_capacity and numa_has_capacity were simple wrappers
> > that read the correct fields from their respective stats structures.
> > 
> 
> That's more sensible indeed. It'd definitely be nice to actually reconcile
> the two balancers with these common helpers, though I guess it doesn't
> *have* to happen with this very patch.

Yep, it can happen at a later time.

As it stands, I think the reconciled series stands on its own even
though there are further improvements that could be built on top.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-18 17:41         ` Mel Gorman
@ 2020-02-18 17:54           ` Valentin Schneider
  0 siblings, 0 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-18 17:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, linux-kernel, pauld, parth, hdanton

On 18/02/2020 17:41, Mel Gorman wrote:
>>> I didn't merge that part of the first version of my series. I was
>>> waiting to see how the implementation for allowing a small degree of
>>> imbalance looks like. If it's entirely confined in adjust_numa_balance
>>                                                      ^^^^^^^^^^^^^^^^^^^
>> Apologies if that's a newbie question, but I'm not familiar with that one.
>> Would that be added in your reconciliation series? I've only had a brief
>> look at it yet (it's next on the chopping block).
>>
> 
> I should have wrote adjust_numa_imbalance but yes, it's part of the
> reconciled series so that NUMA balancing and the load balancer use the
> same helper.
> 

Okay, thanks, then I guess I'll forget about any reconciliation for now
and whinge about it when I'll get to your series ;)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-14 15:27 ` [PATCH v2 4/5] sched/pelt: Add a new runnable average signal Vincent Guittot
  2020-02-18 14:54   ` Valentin Schneider
@ 2020-02-18 21:19   ` Valentin Schneider
  2020-02-19  9:02     ` Vincent Guittot
  2020-02-19  9:08     ` Mel Gorman
  2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
  2 siblings, 2 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-18 21:19 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, linux-kernel
  Cc: pauld, parth, hdanton

On 14/02/2020 15:27, Vincent Guittot wrote:
> Now that runnable_load_avg has been removed, we can replace it by a new
> signal that will highlight the runnable pressure on a cfs_rq. This signal
> track the waiting time of tasks on rq and can help to better define the
> state of rqs.
> 
> At now, only util_avg is used to define the state of a rq:
>   A rq with more that around 80% of utilization and more than 1 tasks is
>   considered as overloaded.
> 
> But the util_avg signal of a rq can become temporaly low after that a task
> migrated onto another rq which can bias the classification of the rq.
> 
> When tasks compete for the same rq, their runnable average signal will be
> higher than util_avg as it will include the waiting time and we can use
> this signal to better classify cfs_rqs.
> 
> The new runnable_avg will track the runnable time of a task which simply
> adds the waiting time to the running time. The runnable _avg of cfs_rq
> will be the /Sum of se's runnable_avg and the runnable_avg of group entity
> will follow the one of the rq similarly to util_avg.
> 

I did a bit of playing around with tracepoints and it seems to be behaving
fine. For instance, if I spawn 12 always runnable tasks (sysbench --test=cpu)
on my Juno (6 CPUs), I get to a system-wide runnable value (\Sum cpu_runnable())
of about 12K. I've only eyeballed them, but migration of the signal values
seem fine too.

I have a slight worry that the rq-wide runnable signal might be too easy to
inflate, since we aggregate for *all* runnable tasks, and that may not play
well with your group_is_overloaded() change (despite having the imbalance_pct
on the "right" side).

In any case I'll need to convince myself of it with some messing around, and
this concerns patch 5 more than patch 4. So FWIW for this one:

Tested-by: Valentin Schneider <valentin.schneider@arm.com>

I also have one (two) more nit(s) below.

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> @@ -227,14 +231,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
>  	 * Step 1: accumulate *_sum since last_update_time. If we haven't
>  	 * crossed period boundaries, finish.
>  	 */
> -	if (!accumulate_sum(delta, sa, load, running))
> +	if (!accumulate_sum(delta, sa, load, runnable, running))
>  		return 0;
>  
>  	return 1;
>  }
>  
>  static __always_inline void
> -___update_load_avg(struct sched_avg *sa, unsigned long load)
> +___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
>  {
>  	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
>  
> @@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
>  	 * Step 2: update *_avg.
>  	 */
>  	sa->load_avg = div_u64(load * sa->load_sum, divider);
> +	sa->runnable_avg =	div _u64(runnable * sa->runnable_sum, divider);
                          ^^^^^^        ^^^^^^^^
                            a)             b)
a) That's a tab

b) The value being passed is always 1, do we really need it to expose it as a
   parameter?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-18 21:19   ` Valentin Schneider
@ 2020-02-19  9:02     ` Vincent Guittot
  2020-02-19  9:08     ` Mel Gorman
  1 sibling, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-19  9:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On Tue, 18 Feb 2020 at 22:19, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 14/02/2020 15:27, Vincent Guittot wrote:
> > Now that runnable_load_avg has been removed, we can replace it by a new
> > signal that will highlight the runnable pressure on a cfs_rq. This signal
> > track the waiting time of tasks on rq and can help to better define the
> > state of rqs.
> >
> > At now, only util_avg is used to define the state of a rq:
> >   A rq with more that around 80% of utilization and more than 1 tasks is
> >   considered as overloaded.
> >
> > But the util_avg signal of a rq can become temporaly low after that a task
> > migrated onto another rq which can bias the classification of the rq.
> >
> > When tasks compete for the same rq, their runnable average signal will be
> > higher than util_avg as it will include the waiting time and we can use
> > this signal to better classify cfs_rqs.
> >
> > The new runnable_avg will track the runnable time of a task which simply
> > adds the waiting time to the running time. The runnable _avg of cfs_rq
> > will be the /Sum of se's runnable_avg and the runnable_avg of group entity
> > will follow the one of the rq similarly to util_avg.
> >
>
> I did a bit of playing around with tracepoints and it seems to be behaving
> fine. For instance, if I spawn 12 always runnable tasks (sysbench --test=cpu)
> on my Juno (6 CPUs), I get to a system-wide runnable value (\Sum cpu_runnable())
> of about 12K. I've only eyeballed them, but migration of the signal values
> seem fine too.
>
> I have a slight worry that the rq-wide runnable signal might be too easy to
> inflate, since we aggregate for *all* runnable tasks, and that may not play
> well with your group_is_overloaded() change (despite having the imbalance_pct
> on the "right" side).
>
> In any case I'll need to convince myself of it with some messing around, and
> this concerns patch 5 more than patch 4. So FWIW for this one:
>
> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
>
> I also have one (two) more nit(s) below.
>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > @@ -227,14 +231,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
> >        * Step 1: accumulate *_sum since last_update_time. If we haven't
> >        * crossed period boundaries, finish.
> >        */
> > -     if (!accumulate_sum(delta, sa, load, running))
> > +     if (!accumulate_sum(delta, sa, load, runnable, running))
> >               return 0;
> >
> >       return 1;
> >  }
> >
> >  static __always_inline void
> > -___update_load_avg(struct sched_avg *sa, unsigned long load)
> > +___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
> >  {
> >       u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> >
> > @@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
> >        * Step 2: update *_avg.
> >        */
> >       sa->load_avg = div_u64(load * sa->load_sum, divider);
> > +     sa->runnable_avg =      div _u64(runnable * sa->runnable_sum, divider);
>                           ^^^^^^        ^^^^^^^^
>                             a)             b)
> a) That's a tab
>
> b) The value being passed is always 1, do we really need it to expose it as a
>    parameter?

In fact, I haven't been able to convince myself if it was better to
add the SCHED_CAPACITY_SCALE range in the _sum or only in the _avg.
That's the reason for this parameter to still being there.
On one side we do a shift at every PELT update and the
attach/detach/propagate are quite straight forward. On the other side
it is done only during attach/detach/propagate but it complexify the
thing. Having it in _sum doesn't seem to be a concern so I will keep
it there and remove the parameter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 4/5] sched/pelt: Add a new runnable average signal
  2020-02-18 21:19   ` Valentin Schneider
  2020-02-19  9:02     ` Vincent Guittot
@ 2020-02-19  9:08     ` Mel Gorman
  1 sibling, 0 replies; 49+ messages in thread
From: Mel Gorman @ 2020-02-19  9:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, linux-kernel, pauld, parth, hdanton

On Tue, Feb 18, 2020 at 09:19:16PM +0000, Valentin Schneider wrote:
> On 14/02/2020 15:27, Vincent Guittot wrote:
> > Now that runnable_load_avg has been removed, we can replace it by a new
> > signal that will highlight the runnable pressure on a cfs_rq. This signal
> > track the waiting time of tasks on rq and can help to better define the
> > state of rqs.
> > 
> > At now, only util_avg is used to define the state of a rq:
> >   A rq with more that around 80% of utilization and more than 1 tasks is
> >   considered as overloaded.
> > 
> > But the util_avg signal of a rq can become temporaly low after that a task
> > migrated onto another rq which can bias the classification of the rq.
> > 
> > When tasks compete for the same rq, their runnable average signal will be
> > higher than util_avg as it will include the waiting time and we can use
> > this signal to better classify cfs_rqs.
> > 
> > The new runnable_avg will track the runnable time of a task which simply
> > adds the waiting time to the running time. The runnable _avg of cfs_rq
> > will be the /Sum of se's runnable_avg and the runnable_avg of group entity
> > will follow the one of the rq similarly to util_avg.
> > 
> 
> I did a bit of playing around with tracepoints and it seems to be behaving
> fine. For instance, if I spawn 12 always runnable tasks (sysbench --test=cpu)
> on my Juno (6 CPUs), I get to a system-wide runnable value (\Sum cpu_runnable())
> of about 12K. I've only eyeballed them, but migration of the signal values
> seem fine too.
> 
> I have a slight worry that the rq-wide runnable signal might be too easy to
> inflate, since we aggregate for *all* runnable tasks, and that may not play
> well with your group_is_overloaded() change (despite having the imbalance_pct
> on the "right" side).
> 
> In any case I'll need to convince myself of it with some messing around, and
> this concerns patch 5 more than patch 4. So FWIW for this one:
> 
> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> I also have one (two) more nit(s) below.
> 

Thanks.

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > @@ -227,14 +231,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
> >  	 * Step 1: accumulate *_sum since last_update_time. If we haven't
> >  	 * crossed period boundaries, finish.
> >  	 */
> > -	if (!accumulate_sum(delta, sa, load, running))
> > +	if (!accumulate_sum(delta, sa, load, runnable, running))
> >  		return 0;
> >  
> >  	return 1;
> >  }
> >  
> >  static __always_inline void
> > -___update_load_avg(struct sched_avg *sa, unsigned long load)
> > +___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
> >  {
> >  	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> >  
> > @@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
> >  	 * Step 2: update *_avg.
> >  	 */
> >  	sa->load_avg = div_u64(load * sa->load_sum, divider);
> > +	sa->runnable_avg =	div _u64(runnable * sa->runnable_sum, divider);
>                           ^^^^^^        ^^^^^^^^
>                             a)             b)
> a) That's a tab
> 

Fixed and I'll post a v4 of my own series with Vincent's included.

> b) The value being passed is always 1, do we really need it to expose it as a
>    parameter?

This does appear to be an oversight but I'm not familiar enough with
pelt to be sure.

___update_load_avg() is called when sum of the load has changed because
a pelt period has passed and it has lost sight and does not care if an
individual sched entity is runnable or not. The parameter was added by
this patch but I cannot find any useful meaning for it.

Vincent, what was your thinking here? Should the parameter be removed?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-18 14:15       ` Vincent Guittot
@ 2020-02-19 11:07         ` Dietmar Eggemann
  2020-02-19 16:26           ` Vincent Guittot
  0 siblings, 1 reply; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-19 11:07 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	linux-kernel, Phil Auld, Parth Shah, Valentin Schneider,
	Hillf Danton

On 18/02/2020 15:15, Vincent Guittot wrote:
> On Tue, 18 Feb 2020 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Tue, Feb 18, 2020 at 01:37:37PM +0100, Dietmar Eggemann wrote:
>>> On 14/02/2020 16:27, Vincent Guittot wrote:
>>>> The walk through the cgroup hierarchy during the enqueue/dequeue of a task
>>>> is split in 2 distinct parts for throttled cfs_rq without any added value
>>>> but making code less readable.
>>>>
>>>> Change the code ordering such that everything related to a cfs_rq
>>>> (throttled or not) will be done in the same loop.
>>>>
>>>> In addition, the same steps ordering is used when updating a cfs_rq:
>>>> - update_load_avg
>>>> - update_cfs_group
>>>> - update *h_nr_running
>>>
>>> Is this code change really necessary? You pay with two extra goto's. We
>>> still have the two for_each_sched_entity(se)'s because of 'if
>>> (se->on_rq); break;'.
>>
>> IIRC he relies on the presented ordering in patch #5 -- adding the
>> running_avg metric.
> 
> Yes, that's the main reason, updating load_avg before h_nr_running

My hunch is you refer to the new function:

static inline void se_update_runnable(struct sched_entity *se)
{
        if (!entity_is_task(se))
                se->runnable_weight = se->my_q->h_nr_running;
}

I don't see the dependency to the 'update_load_avg -> h_nr_running'
order since it operates on se->my_q, not cfs_rq = cfs_rq_of(se), i.e.
se->cfs_rq.

What do I miss here?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-14 15:27 ` [PATCH v2 4/5] sched/pelt: Add a new runnable average signal Vincent Guittot
  2020-02-18 14:54   ` Valentin Schneider
  2020-02-18 21:19   ` Valentin Schneider
@ 2020-02-19 12:55   ` Vincent Guittot
  2020-02-19 14:02     ` Mel Gorman
                       ` (3 more replies)
  2 siblings, 4 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-19 12:55 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton, Vincent Guittot

Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
  A rq with more that around 80% of utilization and more than 1 tasks is
  considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Changes for v3:
- Fixed tab typo
- Removed runnable parameter of ___update_load_avg()

 include/linux/sched.h | 26 ++++++++------
 kernel/sched/debug.c  |  9 +++--
 kernel/sched/fair.c   | 79 ++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/pelt.c   | 37 +++++++++++++++-----
 kernel/sched/sched.h  | 22 +++++++++++-
 5 files changed, 143 insertions(+), 30 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f1cab3df8386..1c37f38d66d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,28 +356,30 @@ struct util_est {
 } __attribute__((__aligned__(sizeof(u64))));
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series
+ * The load/runnable/util_avg accumulates an infinite geometric series
  * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
  *   load_avg = runnable% * scale_load_down(load)
  *
- * where runnable% is the time ratio that a sched_entity is runnable.
- * For cfs_rq, it is the aggregated load_avg of all runnable and
- * blocked sched_entities.
+ * [runnable_avg definition]
+ *
+ *   runnable_avg = runnable% * SCHED_CAPACITY_SCALE
  *
  * [util_avg definition]
  *
  *   util_avg = running% * SCHED_CAPACITY_SCALE
  *
- * where running% is the time ratio that a sched_entity is running on
- * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
- * and blocked sched_entities.
+ * where runnable% is the time ratio that a sched_entity is runnable and
+ * running% the time ratio that a sched_entity is running.
+ *
+ * For cfs_rq, they are the aggregated values of all runnable and blocked
+ * sched_entities.
  *
- * load_avg and util_avg don't direcly factor frequency scaling and CPU
- * capacity scaling. The scaling is done through the rq_clock_pelt that
- * is used for computing those signals (see update_rq_clock_pelt())
+ * The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU
+ * capacity scaling. The scaling is done through the rq_clock_pelt that is used
+ * for computing those signals (see update_rq_clock_pelt())
  *
  * N.B., the above ratios (runnable% and running%) themselves are in the
  * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
@@ -401,9 +403,11 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -467,6 +471,8 @@ struct sched_entity {
 	struct cfs_rq			*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq			*my_q;
+	/* cached value of my_q->h_nr_running */
+	unsigned long			runnable_weight;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cfecaad387c0..8331bc04aea2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -405,6 +405,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.runnable_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,6 +525,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
+			cfs_rq->avg.runnable_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
+			cfs_rq->removed.runnable_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
@@ -944,8 +947,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
+	P(se.avg.runnable_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
+	P(se.avg.runnable_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a23110ad96e8..3b6a90d6315c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,8 +740,10 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * Group entities are initialized with zero load to reflect the fact that
 	 * nothing has been attached to the task group yet.
 	 */
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		sa->runnable_avg = SCHED_CAPACITY_SCALE;
 		sa->load_avg = scale_load_down(se->load.weight);
+	}
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -3194,9 +3196,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial  * and simply copies the
- * running sum over (but still wrong, because the group entity and group rq do
- * not have their PELT windows aligned).
+ * Per the above update_tg_cfs_util() and update_tg_cfs_runnable() are trivial
+ * and simply copies the running/runnable sum over (but still wrong, because
+ * the group entity and group rq do not have their PELT windows aligned).
  *
  * However, update_tg_cfs_load() is more complex. So we have:
  *
@@ -3278,6 +3280,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
+static inline void
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+{
+	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
+	/* Set new sched_entity's runnable */
+	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
+	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq runnable */
+	add_positive(&cfs_rq->avg.runnable_avg, delta);
+	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
+}
+
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
@@ -3358,6 +3386,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
@@ -3428,7 +3457,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3439,7 +3468,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
+		swap(cfs_rq->removed.runnable_avg, removed_runnable);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3451,7 +3480,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
+		r = removed_runnable;
+		sub_positive(&sa->runnable_avg, r);
+		sub_positive(&sa->runnable_sum, r * divider);
+
+		/*
+		 * removed_runnable is the unweighted version of removed_load so we
+		 * can use it to estimate removed_load_sum.
+		 */
+		add_tg_cfs_propagate(cfs_rq,
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
 
 		decayed = 1;
 	}
@@ -3497,6 +3535,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 */
 	se->avg.util_sum = se->avg.util_avg * divider;
 
+	se->avg.runnable_sum = se->avg.runnable_avg * divider;
+
 	se->avg.load_sum = divider;
 	if (se_weight(se)) {
 		se->avg.load_sum =
@@ -3506,6 +3546,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	cfs_rq->avg.runnable_avg += se->avg.runnable_avg;
+	cfs_rq->avg.runnable_sum += se->avg.runnable_sum;
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
@@ -3527,6 +3569,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
+	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
@@ -3633,10 +3677,15 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
-	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
+	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
+static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->avg.runnable_avg;
+}
+
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;
@@ -3963,11 +4012,13 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	se_update_runnable(se);
 	update_cfs_group(se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -4045,11 +4096,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
+	se_update_runnable(se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -5220,6 +5273,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running++;
@@ -5317,6 +5371,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running--;
@@ -5389,6 +5444,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
+static unsigned long cpu_runnable(struct rq *rq)
+{
+	return cfs_rq_runnable_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7520,6 +7580,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
+	if (cfs_rq->avg.runnable_sum)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 3eb0ed333dcb..2cc88d9e3b38 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, int running)
+	       unsigned long load, unsigned long runnable, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,6 +121,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
+		sa->runnable_sum =
+			decay_load(sa->runnable_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -146,6 +148,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -182,7 +186,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, int running)
+		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
 
@@ -218,7 +222,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		running = 0;
+		runnable = running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -227,7 +231,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, running))
+	if (!accumulate_sum(delta, sa, load, runnable, running))
 		return 0;
 
 	return 1;
@@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_avg = div_u64(sa->runnable_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -250,9 +255,14 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
  *
  *   task:
  *     se_weight()   = se->load.weight
+ *     se_runnable() = !!on_rq
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = grq->h_nr_running
+ *
+ *   runnable_sum = se_runnable() * runnable = grq->runnable_sum
+ *   runnable_avg = runnable_sum
  *
  *   load_sum := runnable
  *   load_avg = se_weight(se) * load_sum
@@ -261,13 +271,16 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
  *
  * cfq_rq:
  *
+ *   runnable_sum = \Sum se->avg.runnable_sum
+ *   runnable_avg = \Sum se->avg.runnable_avg
+ *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0)) {
+	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
 		trace_pelt_se_tp(se);
 		return 1;
@@ -278,7 +291,8 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+				cfs_rq->curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
@@ -293,6 +307,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
+				cfs_rq->h_nr_running,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
@@ -310,13 +325,14 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
+				running,
 				running,
 				running)) {
 
@@ -335,13 +351,14 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
+				running,
 				running,
 				running)) {
 
@@ -361,7 +378,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_load_avg are not supported and meaningless.
  *
  */
 
@@ -389,9 +406,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * rq->clock += delta with delta >= running
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
+				0,
 				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
+				1,
 				1,
 				1);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8760b656a349..5a88d7665f63 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -527,7 +527,7 @@ struct cfs_rq {
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
-		unsigned long	runnable_sum;
+		unsigned long	runnable_avg;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -688,9 +688,29 @@ struct dl_rq {
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
 
+static inline void se_update_runnable(struct sched_entity *se)
+{
+	if (!entity_is_task(se))
+		se->runnable_weight = se->my_q->h_nr_running;
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return !!se->on_rq;
+	else
+		return se->runnable_weight;
+}
+
 #else
 #define entity_is_task(se)	1
 
+static inline void se_update_runnable(struct sched_entity *se) {}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return !!se->on_rq;
+}
 #endif
 
 #ifdef CONFIG_SMP
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
@ 2020-02-19 14:02     ` Mel Gorman
  2020-02-19 20:10     ` Valentin Schneider
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: Mel Gorman @ 2020-02-19 14:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel, pauld, parth, valentin.schneider, hdanton

On Wed, Feb 19, 2020 at 01:55:13PM +0100, Vincent Guittot wrote:
> Now that runnable_load_avg has been removed, we can replace it by a new
> signal that will highlight the runnable pressure on a cfs_rq. This signal
> track the waiting time of tasks on rq and can help to better define the
> state of rqs.
> 
> At now, only util_avg is used to define the state of a rq:
>   A rq with more that around 80% of utilization and more than 1 tasks is
>   considered as overloaded.
> 
> But the util_avg signal of a rq can become temporaly low after that a task
> migrated onto another rq which can bias the classification of the rq.
> 
> When tasks compete for the same rq, their runnable average signal will be
> higher than util_avg as it will include the waiting time and we can use
> this signal to better classify cfs_rqs.
> 
> The new runnable_avg will track the runnable time of a task which simply
> adds the waiting time to the running time. The runnable _avg of cfs_rq
> will be the /Sum of se's runnable_avg and the runnable_avg of group entity
> will follow the one of the rq similarly to util_avg.
> 
> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Thanks.

Picked up and included in "Reconcile NUMA balancing decisions with the
load balancer v4". The only change I made to your patches was move the
reintroduction of cpu_runnable to the patch that requires it to avoid a
build warning.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-19 11:07         ` Dietmar Eggemann
@ 2020-02-19 16:26           ` Vincent Guittot
  2020-02-20 13:38             ` Dietmar Eggemann
  0 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-19 16:26 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On Wed, 19 Feb 2020 at 12:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 18/02/2020 15:15, Vincent Guittot wrote:
> > On Tue, 18 Feb 2020 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> On Tue, Feb 18, 2020 at 01:37:37PM +0100, Dietmar Eggemann wrote:
> >>> On 14/02/2020 16:27, Vincent Guittot wrote:
> >>>> The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> >>>> is split in 2 distinct parts for throttled cfs_rq without any added value
> >>>> but making code less readable.
> >>>>
> >>>> Change the code ordering such that everything related to a cfs_rq
> >>>> (throttled or not) will be done in the same loop.
> >>>>
> >>>> In addition, the same steps ordering is used when updating a cfs_rq:
> >>>> - update_load_avg
> >>>> - update_cfs_group
> >>>> - update *h_nr_running
> >>>
> >>> Is this code change really necessary? You pay with two extra goto's. We
> >>> still have the two for_each_sched_entity(se)'s because of 'if
> >>> (se->on_rq); break;'.
> >>
> >> IIRC he relies on the presented ordering in patch #5 -- adding the
> >> running_avg metric.
> >
> > Yes, that's the main reason, updating load_avg before h_nr_running
>
> My hunch is you refer to the new function:
>
> static inline void se_update_runnable(struct sched_entity *se)
> {
>         if (!entity_is_task(se))
>                 se->runnable_weight = se->my_q->h_nr_running;
> }
>
> I don't see the dependency to the 'update_load_avg -> h_nr_running'
> order since it operates on se->my_q, not cfs_rq = cfs_rq_of(se), i.e.
> se->cfs_rq.
>
> What do I miss here?

update_load_avg() updates both se and cfs_rq so if you update
cfs_rq->h_nr_running before calling update_load_avg() like in the 2nd
for_each_sched_entity, you will update cfs_rq runnable_avg for the
past time slot with the new h_nr_running value instead of the previous
value.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
  2020-02-19 14:02     ` Mel Gorman
@ 2020-02-19 20:10     ` Valentin Schneider
  2020-02-20 14:36       ` Vincent Guittot
  2020-02-20 15:04     ` Dietmar Eggemann
  2020-02-21  9:44     ` Dietmar Eggemann
  3 siblings, 1 reply; 49+ messages in thread
From: Valentin Schneider @ 2020-02-19 20:10 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, linux-kernel
  Cc: pauld, parth, hdanton

On 19/02/2020 12:55, Vincent Guittot wrote:
> @@ -740,8 +740,10 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * Group entities are initialized with zero load to reflect the fact that
>  	 * nothing has been attached to the task group yet.
>  	 */
> -	if (entity_is_task(se))
> +	if (entity_is_task(se)) {
> +		sa->runnable_avg = SCHED_CAPACITY_SCALE;

So this is a comment that's more related to patch 5, but the relevant bit is
here. I'm thinking this initialization might be too aggressive wrt load
balance. This will also give different results between symmetric vs
asymmetric topologies - a single fork() will make a LITTLE CPU group (at the
base domain level) overloaded straight away. That won't happen for bigs or on
symmetric topologies because

  // group_is_overloaded()
  sgs->group_capacity * imbalance_pct) < (sgs->group_runnable * 100)

will be false - it would take more than one task for that to happen (due to
the imbalance_pct).

So maybe what we want here instead is to mimic what he have for utilization,
i.e. initialize to half the spare capacity of the local CPU. IOW, 
conceptually something like this:

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99249a2484b4..762717092235 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,10 +740,8 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * Group entities are initialized with zero load to reflect the fact that
 	 * nothing has been attached to the task group yet.
 	 */
-	if (entity_is_task(se)) {
-		sa->runnable_avg = SCHED_CAPACITY_SCALE;
+	if (entity_is_task(se))
 		sa->load_avg = scale_load_down(se->load.weight);
-	}
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -796,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 		}
 	}
 
+	sa->runnable_avg = sa->util_avg;
+
 	if (p->sched_class != &fair_sched_class) {
 		/*
 		 * For !fair tasks do:
---

The current approach has the merit of giving some sort of hint to the LB
that there is a bunch of new tasks that it could spread out, but I fear it
is too aggressive.

>  		sa->load_avg = scale_load_down(se->load.weight);
> +	}
>  
>  	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-19 16:26           ` Vincent Guittot
@ 2020-02-20 13:38             ` Dietmar Eggemann
       [not found]               ` <20200222152541.GA11669@geo.homenetwork>
  0 siblings, 1 reply; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-20 13:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On 19/02/2020 17:26, Vincent Guittot wrote:
> On Wed, 19 Feb 2020 at 12:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 18/02/2020 15:15, Vincent Guittot wrote:
>>> On Tue, 18 Feb 2020 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Tue, Feb 18, 2020 at 01:37:37PM +0100, Dietmar Eggemann wrote:
>>>>> On 14/02/2020 16:27, Vincent Guittot wrote:
>>>>>> The walk through the cgroup hierarchy during the enqueue/dequeue of a task
>>>>>> is split in 2 distinct parts for throttled cfs_rq without any added value
>>>>>> but making code less readable.
>>>>>>
>>>>>> Change the code ordering such that everything related to a cfs_rq
>>>>>> (throttled or not) will be done in the same loop.
>>>>>>
>>>>>> In addition, the same steps ordering is used when updating a cfs_rq:
>>>>>> - update_load_avg
>>>>>> - update_cfs_group
>>>>>> - update *h_nr_running
>>>>>
>>>>> Is this code change really necessary? You pay with two extra goto's. We
>>>>> still have the two for_each_sched_entity(se)'s because of 'if
>>>>> (se->on_rq); break;'.
>>>>
>>>> IIRC he relies on the presented ordering in patch #5 -- adding the
>>>> running_avg metric.
>>>
>>> Yes, that's the main reason, updating load_avg before h_nr_running
>>
>> My hunch is you refer to the new function:
>>
>> static inline void se_update_runnable(struct sched_entity *se)
>> {
>>         if (!entity_is_task(se))
>>                 se->runnable_weight = se->my_q->h_nr_running;
>> }
>>
>> I don't see the dependency to the 'update_load_avg -> h_nr_running'
>> order since it operates on se->my_q, not cfs_rq = cfs_rq_of(se), i.e.
>> se->cfs_rq.
>>
>> What do I miss here?
> 
> update_load_avg() updates both se and cfs_rq so if you update
> cfs_rq->h_nr_running before calling update_load_avg() like in the 2nd
> for_each_sched_entity, you will update cfs_rq runnable_avg for the
> past time slot with the new h_nr_running value instead of the previous
> value.

Ah, now I see:

update_load_avg()
  update_cfs_rq_load_avg()
    __update_load_avg_cfs_rq()
       ___update_load_sum(..., cfs_rq->h_nr_running, ...)

                               ^^^^^^^^^^^^^^^^^^^^

Not really obvious IMHO, since the code is introduced only in 4/5.

Could you add a comment to this patch header?

I see you mentioned this dependency already in v1 discussion

https://lore.kernel.org/r/CAKfTPtAM=kgF7Fz-JKFY+s_k5KFirs-8Bub3s1Eqtq7P0NMa0w@mail.gmail.com

"... But the following patches make PELT using h_nr_running ...".

IMHO it would be helpful to have this explanation in the 1/5 patch
header so people stop wondering why this is necessary.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-19 20:10     ` Valentin Schneider
@ 2020-02-20 14:36       ` Vincent Guittot
  2020-02-20 16:11         ` Valentin Schneider
  0 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-20 14:36 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On Wed, 19 Feb 2020 at 21:10, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 19/02/2020 12:55, Vincent Guittot wrote:
> > @@ -740,8 +740,10 @@ void init_entity_runnable_average(struct sched_entity *se)
> >        * Group entities are initialized with zero load to reflect the fact that
> >        * nothing has been attached to the task group yet.
> >        */
> > -     if (entity_is_task(se))
> > +     if (entity_is_task(se)) {
> > +             sa->runnable_avg = SCHED_CAPACITY_SCALE;
>
> So this is a comment that's more related to patch 5, but the relevant bit is
> here. I'm thinking this initialization might be too aggressive wrt load
> balance. This will also give different results between symmetric vs
> asymmetric topologies - a single fork() will make a LITTLE CPU group (at the
> base domain level) overloaded straight away. That won't happen for bigs or on
> symmetric topologies because
>
>   // group_is_overloaded()
>   sgs->group_capacity * imbalance_pct) < (sgs->group_runnable * 100)
>
> will be false - it would take more than one task for that to happen (due to
> the imbalance_pct).
>
> So maybe what we want here instead is to mimic what he have for utilization,
> i.e. initialize to half the spare capacity of the local CPU. IOW,
> conceptually something like this:
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99249a2484b4..762717092235 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -740,10 +740,8 @@ void init_entity_runnable_average(struct sched_entity *se)
>          * Group entities are initialized with zero load to reflect the fact that
>          * nothing has been attached to the task group yet.
>          */
> -       if (entity_is_task(se)) {
> -               sa->runnable_avg = SCHED_CAPACITY_SCALE;
> +       if (entity_is_task(se))
>                 sa->load_avg = scale_load_down(se->load.weight);
> -       }
>
>         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
> @@ -796,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
>                 }
>         }
>
> +       sa->runnable_avg = sa->util_avg;
> +
>         if (p->sched_class != &fair_sched_class) {
>                 /*
>                  * For !fair tasks do:
> ---
>
> The current approach has the merit of giving some sort of hint to the LB
> that there is a bunch of new tasks that it could spread out, but I fear it
> is too aggressive.

I agree that setting by default to SCHED_CAPACITY_SCALE is too much
for little core.
The problem for little core can be fixed by using the cpu capacity instead

@@ -796,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
                }
        }

+       sa->runnable_avg = cpu_scale;
+
        if (p->sched_class != &fair_sched_class) {
                /*
                 * For !fair tasks do:
>
> >               sa->load_avg = scale_load_down(se->load.weight);
> > +     }
> >
> >       /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
> >  }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
  2020-02-19 14:02     ` Mel Gorman
  2020-02-19 20:10     ` Valentin Schneider
@ 2020-02-20 15:04     ` Dietmar Eggemann
  2020-02-21  9:44     ` Dietmar Eggemann
  3 siblings, 0 replies; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-20 15:04 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton

On 19/02/2020 13:55, Vincent Guittot wrote:
> Now that runnable_load_avg has been removed, we can replace it by a new
> signal that will highlight the runnable pressure on a cfs_rq. This signal
> track the waiting time of tasks on rq and can help to better define the
> state of rqs.
> 
> At now, only util_avg is used to define the state of a rq:
>   A rq with more that around 80% of utilization and more than 1 tasks is
>   considered as overloaded.

Don't we make the distinction of overutilized and overloaded?

update_sg_lb_stats(), SG_OVERUTILIZED and SG_OVERLOAD

[...]

> @@ -310,13 +325,14 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
>   *   util_sum = cpu_scale * load_sum
>   *   runnable_sum = util_sum
>   *
> - *   load_avg is not supported and meaningless.
> + *   load_avg and runnable_load_avg are not supported and meaningless.

Nit pick:

s/runnable_load_avg/runnable_avg

[...]

> + *   load_avg and runnable_load_avg are not supported and meaningless.
>   *
>   */
>  
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

[...]

> - *   load_avg is not supported and meaningless.
> + *   load_avg and runnable_load_avg are not supported and meaningless.
>   *
>   */
>  
> @@ -389,9 +406,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)

[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-20 14:36       ` Vincent Guittot
@ 2020-02-20 16:11         ` Valentin Schneider
  2020-02-21  8:56           ` Vincent Guittot
  2020-02-21  9:04           ` Mel Gorman
  0 siblings, 2 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-20 16:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On 20/02/2020 14:36, Vincent Guittot wrote:
> I agree that setting by default to SCHED_CAPACITY_SCALE is too much
> for little core.
> The problem for little core can be fixed by using the cpu capacity instead
> 

So that's indeed better for big.LITTLE & co. Any reason however for not
aligning with the initialization of util_avg ?

With the default MC imbalance_pct (117), it takes 875 utilization to make
a single CPU group (with 1024 capacity) overloaded (group_is_overloaded()).
For a completely idle CPU, that means forking at least 3 tasks (512 + 256 +
128 util_avg)

With your change, it only takes 2 tasks. I know I'm being nitpicky here, but
I feel like those should be aligned, unless we have a proper argument against
it - in which case this should also appear in the changelog with so far only
mentions issues with util_avg migration, not the fork time initialization.

> @@ -796,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
>                 }
>         }
> 
> +       sa->runnable_avg = cpu_scale;
> +
>         if (p->sched_class != &fair_sched_class) {
>                 /*
>                  * For !fair tasks do:
>>
>>>               sa->load_avg = scale_load_down(se->load.weight);
>>> +     }
>>>
>>>       /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>>>  }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-20 16:11         ` Valentin Schneider
@ 2020-02-21  8:56           ` Vincent Guittot
  2020-02-24 15:57             ` Valentin Schneider
  2020-02-21  9:04           ` Mel Gorman
  1 sibling, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-21  8:56 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

On Thu, 20 Feb 2020 at 17:11, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 20/02/2020 14:36, Vincent Guittot wrote:
> > I agree that setting by default to SCHED_CAPACITY_SCALE is too much
> > for little core.
> > The problem for little core can be fixed by using the cpu capacity instead
> >
>
> So that's indeed better for big.LITTLE & co. Any reason however for not
> aligning with the initialization of util_avg ?

The runnable_avg is the unweighted version of the load_avg so they
should both be sync at init and SCHED_CAPACITY_SCALE is in fact the
right value. Using cpu_scale is the same for smp and big core so we
can use it instead.

Then, the initial value of util_avg has never reflected some kind of
realistic value for the utilization of a new task, especially if those
tasks will become big ones. Runnable_avg now balances this effect to
say that we don't know what will be the behavior of the new task,
which might end up using all spare capacity although current
utilization is low and CPU is not "fully used". In fact, this is
exactly the purpose of runnable: highlight that there is maybe no
spare capacity even if CPU's utilization is low because of external
event like task migration or having new tasks with most probably wrong
utilization.

That being said, there is a bigger problem with the current version of
this patch, which is that I forgot to use runnable in
update_sg_wakeup_stats(). I have a patch that fixes this problem.

Also, I have tested both proposals with hackbench on my octo cores and
using cpu_scale gives slightly better results than util_avg, which
probably reflects the case I mentioned above.

grp     cpu_scale            util_avg               improvement
1       1,191(+/-0.77 %)     1,204(+/-1.16 %)       -1.07 %
4       1,147(+/-1.14 %)     1,195(+/-0.52 %)       -4.21 %
8       1,112(+/-1,52 %)     1,124(+/-1,45 %)       -1.12 %
16      1,163(+/-1.72 %)     1,169(+/-1.58 %)       -0,45 %

>
> With the default MC imbalance_pct (117), it takes 875 utilization to make
> a single CPU group (with 1024 capacity) overloaded (group_is_overloaded()).
> For a completely idle CPU, that means forking at least 3 tasks (512 + 256 +
> 128 util_avg)
>
> With your change, it only takes 2 tasks. I know I'm being nitpicky here, but
> I feel like those should be aligned, unless we have a proper argument against

I don't see any reason for keeping them aligned

> it - in which case this should also appear in the changelog with so far only
> mentions issues with util_avg migration, not the fork time initialization.
>
> > @@ -796,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
> >                 }
> >         }
> >
> > +       sa->runnable_avg = cpu_scale;
> > +
> >         if (p->sched_class != &fair_sched_class) {
> >                 /*
> >                  * For !fair tasks do:
> >>
> >>>               sa->load_avg = scale_load_down(se->load.weight);
> >>> +     }
> >>>
> >>>       /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
> >>>  }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-20 16:11         ` Valentin Schneider
  2020-02-21  8:56           ` Vincent Guittot
@ 2020-02-21  9:04           ` Mel Gorman
  2020-02-21  9:25             ` Vincent Guittot
  1 sibling, 1 reply; 49+ messages in thread
From: Mel Gorman @ 2020-02-21  9:04 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Phil Auld, Parth Shah, Hillf Danton

On Thu, Feb 20, 2020 at 04:11:18PM +0000, Valentin Schneider wrote:
> On 20/02/2020 14:36, Vincent Guittot wrote:
> > I agree that setting by default to SCHED_CAPACITY_SCALE is too much
> > for little core.
> > The problem for little core can be fixed by using the cpu capacity instead
> > 
> 
> So that's indeed better for big.LITTLE & co. Any reason however for not
> aligning with the initialization of util_avg ?
> 
> With the default MC imbalance_pct (117), it takes 875 utilization to make
> a single CPU group (with 1024 capacity) overloaded (group_is_overloaded()).
> For a completely idle CPU, that means forking at least 3 tasks (512 + 256 +
> 128 util_avg)
> 
> With your change, it only takes 2 tasks. I know I'm being nitpicky here, but
> I feel like those should be aligned, unless we have a proper argument against
> it - in which case this should also appear in the changelog with so far only
> mentions issues with util_avg migration, not the fork time initialization.
> 

So, what is the way forward here? Should this patch be modified now,
a patch be placed on top or go with what we have for the moment that
works for symmetric CPUs and deal with the asym case later?

I do not have any asym systems at all so I've no means of checking
whether there is a problem or not.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-21  9:04           ` Mel Gorman
@ 2020-02-21  9:25             ` Vincent Guittot
  2020-02-21 10:40               ` Mel Gorman
  0 siblings, 1 reply; 49+ messages in thread
From: Vincent Guittot @ 2020-02-21  9:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Phil Auld, Parth Shah, Hillf Danton

On Fri, 21 Feb 2020 at 10:04, Mel Gorman <mgorman@suse.de> wrote:
>
> On Thu, Feb 20, 2020 at 04:11:18PM +0000, Valentin Schneider wrote:
> > On 20/02/2020 14:36, Vincent Guittot wrote:
> > > I agree that setting by default to SCHED_CAPACITY_SCALE is too much
> > > for little core.
> > > The problem for little core can be fixed by using the cpu capacity instead
> > >
> >
> > So that's indeed better for big.LITTLE & co. Any reason however for not
> > aligning with the initialization of util_avg ?
> >
> > With the default MC imbalance_pct (117), it takes 875 utilization to make
> > a single CPU group (with 1024 capacity) overloaded (group_is_overloaded()).
> > For a completely idle CPU, that means forking at least 3 tasks (512 + 256 +
> > 128 util_avg)
> >
> > With your change, it only takes 2 tasks. I know I'm being nitpicky here, but
> > I feel like those should be aligned, unless we have a proper argument against
> > it - in which case this should also appear in the changelog with so far only
> > mentions issues with util_avg migration, not the fork time initialization.
> >
>
> So, what is the way forward here? Should this patch be modified now,
> a patch be placed on top or go with what we have for the moment that
> works for symmetric CPUs and deal with the asym case later?
>
> I do not have any asym systems at all so I've no means of checking
> whether there is a problem or not.

I'm going to send a new version at least for patch 4 and 5 using
cpu_scale as initial value and fixing update_sg_wakeup_stats()

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
                       ` (2 preceding siblings ...)
  2020-02-20 15:04     ` Dietmar Eggemann
@ 2020-02-21  9:44     ` Dietmar Eggemann
  2020-02-21 11:47       ` Vincent Guittot
  3 siblings, 1 reply; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-21  9:44 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton

On 19/02/2020 13:55, Vincent Guittot wrote:

[...]

> +static inline long se_runnable(struct sched_entity *se)
> +{

Why returning long here? sched_entity::runnable_weight is unsigned long
but could be unsigned int (cfs_rq::h_nr_running is unsigned int).

___update_load_sum() has 'unsigned long runnable' as parameter.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 3/5] sched/pelt: Remove unused runnable load average
  2020-02-14 15:27 ` [PATCH v2 3/5] sched/pelt: Remove unused runnable load average Vincent Guittot
@ 2020-02-21  9:57   ` Dietmar Eggemann
  2020-02-21 11:56     ` Vincent Guittot
  0 siblings, 1 reply; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-21  9:57 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton

On 14/02/2020 16:27, Vincent Guittot wrote:

[...]

fdef CONFIG_SMP
> @@ -2940,15 +2913,12 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>  		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
>  
>  		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> -		se->avg.runnable_load_avg =
> -			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
>  	} while (0);
>  #endif
>  
>  	enqueue_load_avg(cfs_rq, se);
>  	if (se->on_rq) {
>  		account_entity_enqueue(cfs_rq, se);
> -		enqueue_runnable_load_avg(cfs_rq, se);
>  	}

Nit pick: No curly brackets needed anymore.

[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 0/5] remove runnable_load_avg and improve group_classify
  2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
                   ` (5 preceding siblings ...)
  2020-02-15 21:58 ` [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Mel Gorman
@ 2020-02-21  9:58 ` Dietmar Eggemann
  6 siblings, 0 replies; 49+ messages in thread
From: Dietmar Eggemann @ 2020-02-21  9:58 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: pauld, parth, valentin.schneider, hdanton

On 14/02/2020 16:27, Vincent Guittot wrote:
> This new version stays quite close to the previous one and should 
> replace without problems the previous one that part of Mel's patchset:
> https://lkml.org/lkml/2020/2/14/156
> 
> NUMA load balancing is the last remaining piece of code that uses the 
> runnable_load_avg of PELT to balance tasks between nodes. The normal
> load_balance has replaced it by a better description of the current state
> of the group of cpus.  The same policy can be applied to the numa
> balancing.
> 
> Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
> signal that tracks the waiting time of tasks on rq. Currently, the state
> of a group of CPUs is defined thanks to the number of running task and the
> level of utilization of rq. But the utilization can be temporarly low
> after the migration of a task whereas the rq is still overloaded with
> tasks. In such case where tasks were competing for the rq, the
> runnable_avg will stay high after the migration.

With those small things I commented on:

Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-21  9:25             ` Vincent Guittot
@ 2020-02-21 10:40               ` Mel Gorman
  2020-02-21 13:28                 ` Vincent Guittot
  0 siblings, 1 reply; 49+ messages in thread
From: Mel Gorman @ 2020-02-21 10:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Phil Auld, Parth Shah, Hillf Danton

On Fri, Feb 21, 2020 at 10:25:27AM +0100, Vincent Guittot wrote:
> On Fri, 21 Feb 2020 at 10:04, Mel Gorman <mgorman@suse.de> wrote:
> >
> > On Thu, Feb 20, 2020 at 04:11:18PM +0000, Valentin Schneider wrote:
> > > On 20/02/2020 14:36, Vincent Guittot wrote:
> > > > I agree that setting by default to SCHED_CAPACITY_SCALE is too much
> > > > for little core.
> > > > The problem for little core can be fixed by using the cpu capacity instead
> > > >
> > >
> > > So that's indeed better for big.LITTLE & co. Any reason however for not
> > > aligning with the initialization of util_avg ?
> > >
> > > With the default MC imbalance_pct (117), it takes 875 utilization to make
> > > a single CPU group (with 1024 capacity) overloaded (group_is_overloaded()).
> > > For a completely idle CPU, that means forking at least 3 tasks (512 + 256 +
> > > 128 util_avg)
> > >
> > > With your change, it only takes 2 tasks. I know I'm being nitpicky here, but
> > > I feel like those should be aligned, unless we have a proper argument against
> > > it - in which case this should also appear in the changelog with so far only
> > > mentions issues with util_avg migration, not the fork time initialization.
> > >
> >
> > So, what is the way forward here? Should this patch be modified now,
> > a patch be placed on top or go with what we have for the moment that
> > works for symmetric CPUs and deal with the asym case later?
> >
> > I do not have any asym systems at all so I've no means of checking
> > whether there is a problem or not.
> 
> I'm going to send a new version at least for patch 4 and 5 using
> cpu_scale as initial value and fixing update_sg_wakeup_stats()
> 

No problem. FWIW, when I see them, I'll slot them in and rerun the tests
as the previous results will be invalidated. Obviously the asym case will
not be tested by me but I imagine you or Valentin have that covered.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-21  9:44     ` Dietmar Eggemann
@ 2020-02-21 11:47       ` Vincent Guittot
  0 siblings, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-21 11:47 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On Fri, 21 Feb 2020 at 10:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 19/02/2020 13:55, Vincent Guittot wrote:
>
> [...]
>
> > +static inline long se_runnable(struct sched_entity *se)
> > +{
>
> Why returning long here? sched_entity::runnable_weight is unsigned long
> but could be unsigned int (cfs_rq::h_nr_running is unsigned int).

I have reused the same prototype as for runnable_laod_avg

>
> ___update_load_sum() has 'unsigned long runnable' as parameter.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 3/5] sched/pelt: Remove unused runnable load average
  2020-02-21  9:57   ` Dietmar Eggemann
@ 2020-02-21 11:56     ` Vincent Guittot
  0 siblings, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-21 11:56 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Phil Auld, Parth Shah,
	Valentin Schneider, Hillf Danton

On Fri, 21 Feb 2020 at 10:58, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 14/02/2020 16:27, Vincent Guittot wrote:
>
> [...]
>
> fdef CONFIG_SMP
> > @@ -2940,15 +2913,12 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >               u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
> >
> >               se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> > -             se->avg.runnable_load_avg =
> > -                     div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
> >       } while (0);
> >  #endif
> >
> >       enqueue_load_avg(cfs_rq, se);
> >       if (se->on_rq) {
> >               account_entity_enqueue(cfs_rq, se);
> > -             enqueue_runnable_load_avg(cfs_rq, se);
> >       }
>
> Nit pick: No curly brackets needed anymore.
>

good catch

> [...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-21 10:40               ` Mel Gorman
@ 2020-02-21 13:28                 ` Vincent Guittot
  0 siblings, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-21 13:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, linux-kernel,
	Phil Auld, Parth Shah, Hillf Danton

On Fri, 21 Feb 2020 at 11:40, Mel Gorman <mgorman@suse.de> wrote:
>
> On Fri, Feb 21, 2020 at 10:25:27AM +0100, Vincent Guittot wrote:
> > On Fri, 21 Feb 2020 at 10:04, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > On Thu, Feb 20, 2020 at 04:11:18PM +0000, Valentin Schneider wrote:
> > > > On 20/02/2020 14:36, Vincent Guittot wrote:
> > > > > I agree that setting by default to SCHED_CAPACITY_SCALE is too much
> > > > > for little core.
> > > > > The problem for little core can be fixed by using the cpu capacity instead
> > > > >
> > > >
> > > > So that's indeed better for big.LITTLE & co. Any reason however for not
> > > > aligning with the initialization of util_avg ?
> > > >
> > > > With the default MC imbalance_pct (117), it takes 875 utilization to make
> > > > a single CPU group (with 1024 capacity) overloaded (group_is_overloaded()).
> > > > For a completely idle CPU, that means forking at least 3 tasks (512 + 256 +
> > > > 128 util_avg)
> > > >
> > > > With your change, it only takes 2 tasks. I know I'm being nitpicky here, but
> > > > I feel like those should be aligned, unless we have a proper argument against
> > > > it - in which case this should also appear in the changelog with so far only
> > > > mentions issues with util_avg migration, not the fork time initialization.
> > > >
> > >
> > > So, what is the way forward here? Should this patch be modified now,
> > > a patch be placed on top or go with what we have for the moment that
> > > works for symmetric CPUs and deal with the asym case later?
> > >
> > > I do not have any asym systems at all so I've no means of checking
> > > whether there is a problem or not.
> >
> > I'm going to send a new version at least for patch 4 and 5 using
> > cpu_scale as initial value and fixing update_sg_wakeup_stats()
> >
>
> No problem. FWIW, when I see them, I'll slot them in and rerun the tests
> as the previous results will be invalidated. Obviously the asym case will

I have just sent the new version.
Thanks for testing

> not be tested by me but I imagine you or Valentin have that covered.
>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/5] sched/pelt: Add a new runnable average signal
  2020-02-21  8:56           ` Vincent Guittot
@ 2020-02-24 15:57             ` Valentin Schneider
  0 siblings, 0 replies; 49+ messages in thread
From: Valentin Schneider @ 2020-02-24 15:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Hillf Danton

I somehow lost track of that email, sorry for the delayed response.

On 2/21/20 8:56 AM, Vincent Guittot wrote:
> On Thu, 20 Feb 2020 at 17:11, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> On 20/02/2020 14:36, Vincent Guittot wrote:
>>> I agree that setting by default to SCHED_CAPACITY_SCALE is too much
>>> for little core.
>>> The problem for little core can be fixed by using the cpu capacity instead
>>>
>>
>> So that's indeed better for big.LITTLE & co. Any reason however for not
>> aligning with the initialization of util_avg ?
> 
> The runnable_avg is the unweighted version of the load_avg so they
> should both be sync at init and SCHED_CAPACITY_SCALE is in fact the
> right value. Using cpu_scale is the same for smp and big core so we
> can use it instead.
> 
> Then, the initial value of util_avg has never reflected some kind of
> realistic value for the utilization of a new task, especially if those
> tasks will become big ones. Runnable_avg now balances this effect to
> say that we don't know what will be the behavior of the new task,
> which might end up using all spare capacity although current
> utilization is low and CPU is not "fully used".

I'd argue that the init values we pick for either runnable_avg or util_avg
are both equally bogus.

> In fact, this is
> exactly the purpose of runnable: highlight that there is maybe no
> spare capacity even if CPU's utilization is low because of external
> event like task migration or having new tasks with most probably wrong
> utilization.
> 
> That being said, there is a bigger problem with the current version of
> this patch, which is that I forgot to use runnable in
> update_sg_wakeup_stats(). I have a patch that fixes this problem.
> 
> Also, I have tested both proposals with hackbench on my octo cores and
> using cpu_scale gives slightly better results than util_avg, which
> probably reflects the case I mentioned above.
> 
> grp     cpu_scale            util_avg               improvement
> 1       1,191(+/-0.77 %)     1,204(+/-1.16 %)       -1.07 %
> 4       1,147(+/-1.14 %)     1,195(+/-0.52 %)       -4.21 %
> 8       1,112(+/-1,52 %)     1,124(+/-1,45 %)       -1.12 %
> 16      1,163(+/-1.72 %)     1,169(+/-1.58 %)       -0,45 %
> 

Interesting, thanks for providing the numbers. I'd be curious to figure out
where the difference really stems from, but in the meantime consider me
convinced ;)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path
       [not found]               ` <20200222152541.GA11669@geo.homenetwork>
@ 2020-02-26 16:30                 ` Vincent Guittot
  0 siblings, 0 replies; 49+ messages in thread
From: Vincent Guittot @ 2020-02-26 16:30 UTC (permalink / raw)
  To: Tao Zhou
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, Phil Auld,
	Parth Shah, Valentin Schneider, Hillf Danton, T. Zhou

Hi Tao,

On Sat, 22 Feb 2020 at 16:23, Tao Zhou <zhout@vivaldi.net> wrote:
>
> Hi,
>
> On Thu, Feb 20, 2020 at 02:38:59PM +0100, Dietmar Eggemann wrote:
> > On 19/02/2020 17:26, Vincent Guittot wrote:
> > > On Wed, 19 Feb 2020 at 12:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>
> > >> On 18/02/2020 15:15, Vincent Guittot wrote:
> > >>> On Tue, 18 Feb 2020 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > >>>>
> > >>>> On Tue, Feb 18, 2020 at 01:37:37PM +0100, Dietmar Eggemann wrote:
> > >>>>> On 14/02/2020 16:27, Vincent Guittot wrote:
> > >>>>>> The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> > >>>>>> is split in 2 distinct parts for throttled cfs_rq without any added value
> > >>>>>> but making code less readable.
> > >>>>>>
> > >>>>>> Change the code ordering such that everything related to a cfs_rq
> > >>>>>> (throttled or not) will be done in the same loop.
> > >>>>>>
> > >>>>>> In addition, the same steps ordering is used when updating a cfs_rq:
> > >>>>>> - update_load_avg
> > >>>>>> - update_cfs_group
> > >>>>>> - update *h_nr_running
> > >>>>>
> > >>>>> Is this code change really necessary? You pay with two extra goto's. We
> > >>>>> still have the two for_each_sched_entity(se)'s because of 'if
> > >>>>> (se->on_rq); break;'.
> > >>>>
> > >>>> IIRC he relies on the presented ordering in patch #5 -- adding the
> > >>>> running_avg metric.
> > >>>
> > >>> Yes, that's the main reason, updating load_avg before h_nr_running
> > >>
> > >> My hunch is you refer to the new function:
> > >>
> > >> static inline void se_update_runnable(struct sched_entity *se)
> > >> {
> > >>         if (!entity_is_task(se))
> > >>                 se->runnable_weight = se->my_q->h_nr_running;
> > >> }
> > >>
> > >> I don't see the dependency to the 'update_load_avg -> h_nr_running'
> > >> order since it operates on se->my_q, not cfs_rq = cfs_rq_of(se), i.e.
> > >> se->cfs_rq.
> > >>
> > >> What do I miss here?
> > >
> > > update_load_avg() updates both se and cfs_rq so if you update
> > > cfs_rq->h_nr_running before calling update_load_avg() like in the 2nd
> > > for_each_sched_entity, you will update cfs_rq runnable_avg for the
> > > past time slot with the new h_nr_running value instead of the previous
> > > value.
> >
> > Ah, now I see:
> >
> > update_load_avg()
> >   update_cfs_rq_load_avg()
> >     __update_load_avg_cfs_rq()
> >        ___update_load_sum(..., cfs_rq->h_nr_running, ...)
> >
> >                                ^^^^^^^^^^^^^^^^^^^^
>
> throttle/unthrottle_cfs_rq() update h_nr_running also. Maybe need
> to consider call se_update_runnable there.

yes you're right, group entity which are not enqueued/dequeued, which
updates runnable_avg and se->runnable_weight, during the
throttling/unthrottling are not updated. I'm going to send a fix on
top of sched/core

> And the name update_se_runnable seems to be consistent with others.

In fact, I have reused the same name as with runnable_load_avg

Thanks and sorry for the late reply

>
> Thanks,
> Tao
>
> > Not really obvious IMHO, since the code is introduced only in 4/5.
> >
> > Could you add a comment to this patch header?
> >
> > I see you mentioned this dependency already in v1 discussion
> >
> > https://lore.kernel.org/r/CAKfTPtAM=kgF7Fz-JKFY+s_k5KFirs-8Bub3s1Eqtq7P0NMa0w@mail.gmail.com
> >
> > "... But the following patches make PELT using h_nr_running ...".
> >
> > IMHO it would be helpful to have this explanation in the 1/5 patch
> > header so people stop wondering why this is necessary.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2020-02-26 16:30 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-14 15:27 [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Vincent Guittot
2020-02-14 15:27 ` [PATCH v2 1/5] sched/fair: Reorder enqueue/dequeue_task_fair path Vincent Guittot
2020-02-18 12:37   ` Dietmar Eggemann
2020-02-18 13:22     ` Peter Zijlstra
2020-02-18 14:15       ` Vincent Guittot
2020-02-19 11:07         ` Dietmar Eggemann
2020-02-19 16:26           ` Vincent Guittot
2020-02-20 13:38             ` Dietmar Eggemann
     [not found]               ` <20200222152541.GA11669@geo.homenetwork>
2020-02-26 16:30                 ` Vincent Guittot
2020-02-14 15:27 ` [PATCH v2 2/5] sched/numa: Replace runnable_load_avg by load_avg Vincent Guittot
2020-02-18 12:37   ` Dietmar Eggemann
2020-02-18 13:50     ` Mel Gorman
2020-02-18 14:17       ` Vincent Guittot
2020-02-18 14:42         ` Dietmar Eggemann
2020-02-18 14:54   ` Valentin Schneider
2020-02-18 15:33     ` Vincent Guittot
2020-02-18 15:38     ` Mel Gorman
2020-02-18 16:50       ` Valentin Schneider
2020-02-18 17:41         ` Mel Gorman
2020-02-18 17:54           ` Valentin Schneider
2020-02-18 16:51       ` Vincent Guittot
2020-02-14 15:27 ` [PATCH v2 3/5] sched/pelt: Remove unused runnable load average Vincent Guittot
2020-02-21  9:57   ` Dietmar Eggemann
2020-02-21 11:56     ` Vincent Guittot
2020-02-14 15:27 ` [PATCH v2 4/5] sched/pelt: Add a new runnable average signal Vincent Guittot
2020-02-18 14:54   ` Valentin Schneider
2020-02-18 15:12     ` Peter Zijlstra
2020-02-18 15:28     ` Vincent Guittot
2020-02-18 16:30       ` Valentin Schneider
2020-02-18 21:19   ` Valentin Schneider
2020-02-19  9:02     ` Vincent Guittot
2020-02-19  9:08     ` Mel Gorman
2020-02-19 12:55   ` [PATCH v3 " Vincent Guittot
2020-02-19 14:02     ` Mel Gorman
2020-02-19 20:10     ` Valentin Schneider
2020-02-20 14:36       ` Vincent Guittot
2020-02-20 16:11         ` Valentin Schneider
2020-02-21  8:56           ` Vincent Guittot
2020-02-24 15:57             ` Valentin Schneider
2020-02-21  9:04           ` Mel Gorman
2020-02-21  9:25             ` Vincent Guittot
2020-02-21 10:40               ` Mel Gorman
2020-02-21 13:28                 ` Vincent Guittot
2020-02-20 15:04     ` Dietmar Eggemann
2020-02-21  9:44     ` Dietmar Eggemann
2020-02-21 11:47       ` Vincent Guittot
2020-02-14 15:27 ` [PATCH v2 5/5] sched/fair: Take into account runnable_avg to classify group Vincent Guittot
2020-02-15 21:58 ` [PATCH v2 0/5] remove runnable_load_avg and improve group_classify Mel Gorman
2020-02-21  9:58 ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).