linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
@ 2016-08-10  0:14 Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347 Yuyang Du
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

Hi Peter,

I should have sent out my flat util hierarchy implementation long time ago,
actually code was there but not rebased. I finally have time to do this,
so here it is. There are also other proposals to solve migrated tasks' util
mobility problem, such as the ones from Dietmar and Vincent.

The sched avgs computation optimization was initiated for the flat util thing,
so I send them out together.

According to Morten and Ben's feedback, I removed 32-bit as a period's upper
bound limit. So, thanks a lot to them.

Thanks,
Yuyang

--

Yuyang Du (10):
  sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347
  documentation: Add scheduler/sched-avg.txt
  sched/fair: Add static to remove_entity_load_avg()
  sched/fair: Rename variable names for sched averages
  sched/fair: Add __always_inline compiler attribute to
    __accumulate_sum()
  sched/fair: Optimize __update_sched_avg()
  sched/fair: Remove useless 64-bit to 32-bit variable conversion
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Implement flat hierarchical structure for util_avg

 Documentation/scheduler/sched-avg.txt |   94 ++++++
 include/linux/sched.h                 |   21 +-
 kernel/sched/core.c                   |   13 +-
 kernel/sched/debug.c                  |   13 +-
 kernel/sched/fair.c                   |  557 ++++++++++++++++++---------------
 kernel/sched/sched.h                  |   27 +-
 6 files changed, 436 insertions(+), 289 deletions(-)
 create mode 100644 Documentation/scheduler/sched-avg.txt

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-24 16:01   ` Vincent Guittot
  2016-08-10  0:14 ` [PATCH v1 02/10] documentation: Add scheduler/sched-avg.txt Yuyang Du
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

In commit 5b51f2f80b3b906ce59bd4dce6eca3c7f34cb1b9
Author: Paul Turner <pjt@google.com>
Date:   Thu Oct 4 13:18:32 2012 +0200

    sched: Make __update_entity_runnable_avg() fast

Paul has a program to compute LOAD_AVG_MAX_N, which basically means
how many periods (at least) are needed for LOAD_AVG_MAX, and the result
of calc_conv(1024) is 345:

  long mult_inv(long c, int n) {
  	return (c * runnable_avg_yN_inv[n]) >>  WMULT_SHIFT;
  }

  void calc_conv(long n) {
  	long old_n;
  	int i = -1;

  	printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
  	do {
  		old_n = n;
  		n = mult_inv(n, 1) + 1024;
  		i++;
  	} while (n != old_n);
  	printf("%d> %ld\n", i - 1, n);
  	printf("\n");
  }

The initial value of i is -1, which should be 1 as far as I can tell.
Accordingly, the final result of LOAD_AVG_MAX_N should be changed
from 345 to 347.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4088eed..997297b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static unsigned long task_h_load(struct task_struct *p);
  */
 #define LOAD_AVG_PERIOD 32
 #define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
+#define LOAD_AVG_MAX_N 347 /* number of full periods to produce LOAD_AVG_MAX */
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 02/10] documentation: Add scheduler/sched-avg.txt
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347 Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-22 16:48   ` Randy Dunlap
  2016-08-10  0:14 ` [PATCH v1 03/10] sched/fair: Add static to remove_entity_load_avg() Yuyang Du
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

This doc file has the program to generate the constants to compute
sched averages.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 Documentation/scheduler/sched-avg.txt |   94 +++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 Documentation/scheduler/sched-avg.txt

diff --git a/Documentation/scheduler/sched-avg.txt b/Documentation/scheduler/sched-avg.txt
new file mode 100644
index 0000000..45be4bd
--- /dev/null
+++ b/Documentation/scheduler/sched-avg.txt
@@ -0,0 +1,94 @@
+The following program is used to generate the constants for
+computing sched averages.
+
+==============================================================
+		C program (compile with -lm)
+==============================================================
+
+#include <math.h>
+#include <stdio.h>
+
+#define HALFLIFE 32
+#define SHIFT 32
+
+double y;
+
+void calc_decay_inv_multiply() {
+	int i;
+	unsigned int x;
+
+	printf("static const u32 __decay_inv_multiply_N[] = {");
+	for(i = 0; i < HALFLIFE; i++) {
+		x = ((1UL<<32)-1)*pow(y, i);
+
+		if (i % 6 == 0) printf("\n\t");
+		printf("0x%8x, ", x);
+	}
+	printf("\n};\n\n");
+}
+
+int sum = 1024;
+void calc_accumulated_sum() {
+	int i;
+
+	printf("static const u32 __accumulated_sum_N[] = {\n\t    0,");
+	for(i = 1; i <= HALFLIFE; i++) {
+		if (i == 1)
+			sum *= y;
+		else
+			sum = sum*y + 1024*y;
+
+		if (i % 11 == 0) printf("\n\t");
+		printf("%5d,", sum);
+	}
+	printf("\n};\n\n");
+}
+
+int n = 1;
+/* first period */
+long max = 1024;
+
+void calc_converged_max() {
+	long last = 0, y_inv = ((1UL<<32)-1)*y;
+
+	for (; ; n++) {
+		if (n > 1)
+			max = ((max*y_inv)>>SHIFT) + 1024;
+			/*
+			 * This is the same as:
+			 * max = max*y + 1024;
+			 */
+
+		if (last == max)
+			break;
+
+		last = max;
+	}
+	n--;
+	printf("#define SCHED_AVG_HALFLIFE %d\n", HALFLIFE);
+	printf("#define SCHED_AVG_MAX %ld\n", max);
+	printf("#define SCHED_AVG_MAX_N %d\n\n", n);
+}
+
+void calc_accumulated_sum_32() {
+	int i, x = sum;
+
+	printf("static const u32 __accumulated_sum_N32[] = {\n\t     0,");
+	for(i = 1; i <= n/HALFLIFE+1; i++) {
+		if (i > 1)
+			x = x/2 + sum;
+
+		if (i % 6 == 0) printf("\n\t");
+		printf("%6d,", x);
+	}
+	printf("\n};\n\n");
+}
+
+void main() {
+	y = pow(0.5, 1/(double)HALFLIFE);
+
+	calc_decay_inv_multiply();
+	calc_accumulated_sum();
+	calc_converged_max();
+	calc_accumulated_sum_32();
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 03/10] sched/fair: Add static to remove_entity_load_avg()
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347 Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 02/10] documentation: Add scheduler/sched-avg.txt Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 04/10] sched/fair: Rename variable names for sched averages Yuyang Du
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

remove_entity_load_avg() is only called in fair.c, so add static to it.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 997297b..dad1ba5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3115,7 +3115,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
  */
-void remove_entity_load_avg(struct sched_entity *se)
+static void remove_entity_load_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 last_update_time;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 04/10] sched/fair: Rename variable names for sched averages
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (2 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 03/10] sched/fair: Add static to remove_entity_load_avg() Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 05/10] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

The names of sched averages (including load_avg and util_avg) have
been changed and added in the past a couple of years, some of the
names are a bit confusing especially to people who first read them.
This patch attempts to make the names more self-explaining. And some
comments are updated too.

The renames are listed as follows:

 - init_entity_runnable_average() to init_entity_sched_avg()

 - post_init_entity_util_avg() to post_init_entity_sched_avg()

 - update_load_avg() to update_sched_avg()

 - enqueue_entity_load_avg() to enqueue_entity_sched_avg()

 - dequeue_entity_load_avg() to dequeue_entity_sched_avg()

 - detach_entity_load_avg() to detach_entity_sched_avg()

 - attach_entity_load_avg() to attach_entity_sched_avg()

 - remove_entity_load_avg() to remove_entity_sched_avg()

 - LOAD_AVG_PERIOD to SCHED_AVG_HALFLIFE

 - LOAD_AVG_MAX_N to SCHED_AVG_MAX_N

 - LOAD_AVG_MAX to SCHED_AVG_MAX

 - runnable_avg_yN_sum[] to __accumulated_sum_N[]

 - runnable_avg_yN_inv[] to __decay_inv_multiply_N[]

 - __compute_runnable_contrib() to __accumulate_sum()

 - decay_load() to __decay_sum()

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |    2 +-
 kernel/sched/core.c   |    4 +-
 kernel/sched/fair.c   |  275 ++++++++++++++++++++++++-------------------------
 kernel/sched/sched.h  |    6 +-
 4 files changed, 143 insertions(+), 144 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 67323aa..912830c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1229,7 +1229,7 @@ struct load_weight {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
- * (see __update_load_avg() in kernel/sched/fair.c).
+ * (see __update_sched_avg() in kernel/sched/fair.c).
  *
  * [load_avg definition]
  *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5c883fe..30d429b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2383,7 +2383,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->sched_class = &fair_sched_class;
 	}
 
-	init_entity_runnable_average(&p->se);
+	init_entity_sched_avg(&p->se);
 
 	/*
 	 * The child is not yet in the pid-hash so no cgroup attach races,
@@ -2545,7 +2545,7 @@ void wake_up_new_task(struct task_struct *p)
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
-	post_init_entity_util_avg(&p->se);
+	post_init_entity_sched_avg(&p->se);
 
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dad1ba5..06819bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -660,16 +660,18 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
- * We choose a half-life close to 1 scheduling period.
- * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are
- * dependent on this value.
+ * Note: everything in sched average calculation, including
+ * __decay_inv_multiply_N, __accumulated_sum_N, __accumulated_sum_N32,
+ * SCHED_AVG_MAX, and SCHED_AVG_MAX_N, is dependent on and only on
+ * (1) exponential decay, (2) a period of 1024*1024ns (~1ms), and (3)
+ * a half-life of 32 periods.
  */
-#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 347 /* number of full periods to produce LOAD_AVG_MAX */
+#define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
+#define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
+#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
-void init_entity_runnable_average(struct sched_entity *se)
+void init_entity_sched_avg(struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
 
@@ -681,7 +683,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 */
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+	sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -691,9 +693,9 @@ void init_entity_runnable_average(struct sched_entity *se)
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
+static int update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
 static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
@@ -720,7 +722,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
  * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
  * if util_avg > util_avg_cap.
  */
-void post_init_entity_util_avg(struct sched_entity *se)
+void post_init_entity_sched_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
@@ -738,7 +740,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		} else {
 			sa->util_avg = cap;
 		}
-		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+		sa->util_sum = sa->util_avg * SCHED_AVG_MAX;
 	}
 
 	if (entity_is_task(se)) {
@@ -747,8 +749,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
 			/*
 			 * For !fair tasks do:
 			 *
-			update_cfs_rq_load_avg(now, cfs_rq, false);
-			attach_entity_load_avg(cfs_rq, se);
+			update_cfs_rq_sched_avg(now, cfs_rq, false);
+			attach_entity_sched_avg(cfs_rq, se);
 			switched_from_fair(rq, p);
 			 *
 			 * such that the next switched_to_fair() has the
@@ -759,22 +761,16 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		}
 	}
 
-	tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
-	attach_entity_load_avg(cfs_rq, se);
+	tg_update = update_cfs_rq_sched_avg(now, cfs_rq, false);
+	attach_entity_sched_avg(cfs_rq, se);
 	if (tg_update)
 		update_tg_load_avg(cfs_rq, false);
 }
 
 #else /* !CONFIG_SMP */
-void init_entity_runnable_average(struct sched_entity *se)
-{
-}
-void post_init_entity_util_avg(struct sched_entity *se)
-{
-}
-static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
-{
-}
+void init_entity_sched_avg(struct sched_entity *se) { }
+void post_init_entity_sched_avg(struct sched_entity *se) { }
+static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) { }
 #endif /* CONFIG_SMP */
 
 /*
@@ -1839,7 +1835,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.load_sum / p->se.load.weight;
-		*period = LOAD_AVG_MAX;
+		*period = SCHED_AVG_MAX;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2583,7 +2579,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 #ifdef CONFIG_SMP
 /* Precomputed fixed inverse multiplies for multiplication by y^n */
-static const u32 runnable_avg_yN_inv[] = {
+static const u32 __decay_inv_multiply_N[] = {
 	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
 	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
 	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
@@ -2596,7 +2592,7 @@ static const u32 runnable_avg_yN_inv[] = {
  * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
  * over-estimates when re-combining.
  */
-static const u32 runnable_avg_yN_sum[] = {
+static const u32 __accumulated_sum_N[] = {
 	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
 	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
 	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
@@ -2613,93 +2609,95 @@ static const u32 __accumulated_sum_N32[] = {
 };
 
 /*
- * Approximate:
- *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ * val * y^n, where y^m ~= 0.5
+ *
+ * n is the number of past periods; a period is ~1ms
+ * m is half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
  */
-static __always_inline u64 decay_load(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u64 n)
 {
 	unsigned int local_n;
 
 	if (!n)
 		return val;
-	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+	else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
 		return 0;
 
 	/* after bounds checking we can collapse to 32-bit */
 	local_n = n;
 
 	/*
-	 * As y^PERIOD = 1/2, we can combine
-	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
-	 * With a look-up table which covers y^n (n<PERIOD)
+	 * As y^HALFLIFE = 1/2, we can combine
+	 *    y^n = 1/2^(n/HALFLIFE) * y^(n%HALFLIFE)
+	 * With a look-up table which covers y^n (n<HALFLIFE)
 	 *
-	 * To achieve constant time decay_load.
+	 * To achieve constant time __decay_load.
 	 */
-	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
-		val >>= local_n / LOAD_AVG_PERIOD;
-		local_n %= LOAD_AVG_PERIOD;
+	if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
+		val >>= local_n / SCHED_AVG_HALFLIFE;
+		local_n %= SCHED_AVG_HALFLIFE;
 	}
 
-	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
 	return val;
 }
 
 /*
- * For updates fully spanning n periods, the contribution to runnable
- * average will be: \Sum 1024*y^n
+ * For updates fully spanning n periods, the accumulated contribution
+ * will be: \Sum 1024*y^n.
  *
- * We can compute this reasonably efficiently by combining:
- *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
+ * We can compute this efficiently by combining:
+ * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __compute_runnable_contrib(u64 n)
+static u32 __accumulate_sum(u64 n)
 {
 	u32 contrib = 0;
 
-	if (likely(n <= LOAD_AVG_PERIOD))
-		return runnable_avg_yN_sum[n];
-	else if (unlikely(n >= LOAD_AVG_MAX_N))
-		return LOAD_AVG_MAX;
+	if (likely(n <= SCHED_AVG_HALFLIFE))
+		return __accumulated_sum_N[n];
+	else if (unlikely(n >= SCHED_AVG_MAX_N))
+		return SCHED_AVG_MAX;
 
-	/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
-	contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
-	n %= LOAD_AVG_PERIOD;
-	contrib = decay_load(contrib, n);
-	return contrib + runnable_avg_yN_sum[n];
+	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
+	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
+	n %= SCHED_AVG_HALFLIFE;
+	contrib = __decay_sum(contrib, n);
+	return contrib + __accumulated_sum_N[n];
 }
 
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series.  To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ * We can represent the historical contribution to sched average as the
+ * coefficients of a geometric series.  To do this we divide the history
+ * into segments of approximately 1ms (1024*1024ns); label the segment that
+ * occurred N-1024us ago p_N, with p_0 corresponding to the current period, e.g.
  *
  * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
  *      p0            p1           p2
  *     (now)       (~1ms ago)  (~2ms ago)
  *
- * Let u_i denote the fraction of p_i that the entity was runnable.
+ * Let u_i denote the fraction of p_i whose state (runnable/running) we count.
  *
  * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
+ * following representation of a sched metric:
  *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
  *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- *   y^32 = 0.5
+ * We choose y based on a half-life of 32 periods (which is ~32ms):
+ *   y^32 = 0.5 => y = (0.5)^(1/32)
  *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
+ * where 32 is the number of periods that a past period's contribution is
+ * halved. This means that the impact of a period every ~32ms ago will be
+ * as much as 50% of the previous value.
  *
  * When a period "rolls over" and we have new u_0`, multiplying the previous
  * sum again by y is sufficient to update:
- *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ *   avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *       = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
+__update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
+		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta, scaled_delta, periods;
 	u32 contrib;
@@ -2759,15 +2757,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		periods = delta / 1024;
 		delta %= 1024;
 
-		sa->load_sum = decay_load(sa->load_sum, periods + 1);
+		sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
 		if (cfs_rq) {
 			cfs_rq->runnable_load_sum =
-				decay_load(cfs_rq->runnable_load_sum, periods + 1);
+				__decay_sum(cfs_rq->runnable_load_sum, periods + 1);
 		}
-		sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
+		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		contrib = __compute_runnable_contrib(periods);
+		contrib = __accumulate_sum(periods);
 		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
@@ -2791,12 +2789,12 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	sa->period_contrib += delta;
 
 	if (decayed) {
-		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+		sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
 		if (cfs_rq) {
 			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+				div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
 		}
-		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 	}
 
 	return decayed;
@@ -2864,8 +2862,8 @@ void set_task_rq_fair(struct sched_entity *se,
 		p_last_update_time = prev->avg.last_update_time;
 		n_last_update_time = next->avg.last_update_time;
 #endif
-		__update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
-				  &se->avg, 0, 0, NULL);
+		__update_sched_avg(p_last_update_time, cpu_of(rq_of(prev)),
+				   &se->avg, 0, 0, NULL);
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
@@ -2920,14 +2918,14 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 } while (0)
 
 /**
- * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
+ * update_cfs_rq_sched_avg - update the cfs_rq's load/util averages
  * @now: current time, as per cfs_rq_clock_task()
  * @cfs_rq: cfs_rq to update
  * @update_freq: should we call cfs_rq_util_change() or will the call do so
  *
  * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
  * avg. The immediate corollary is that all (fair) tasks must be attached, see
- * post_init_entity_util_avg().
+ * post_init_entity_sched_avg().
  *
  * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
  *
@@ -2937,7 +2935,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
  * avg up.
  */
 static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
+update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed, removed_load = 0, removed_util = 0;
@@ -2945,18 +2943,18 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		sub_positive(&sa->load_avg, r);
-		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
+		sub_positive(&sa->load_sum, r * SCHED_AVG_MAX);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sub_positive(&sa->util_avg, r);
-		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
+		sub_positive(&sa->util_sum, r * SCHED_AVG_MAX);
 		removed_util = 1;
 	}
 
-	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+	decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
 		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
@@ -2971,7 +2969,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 }
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_sched_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -2982,23 +2980,23 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	__update_sched_avg(now, cpu, &se->avg,
+			   se->on_rq * scale_load_down(se->load.weight),
+			   cfs_rq->curr == se, NULL);
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
+	if (update_cfs_rq_sched_avg(now, cfs_rq, true) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
 }
 
 /**
- * attach_entity_load_avg - attach this entity to its cfs_rq load avg
+ * attach_entity_sched_avg - attach this entity to its cfs_rq load avg
  * @cfs_rq: cfs_rq to attach to
  * @se: sched_entity to attach
  *
- * Must call update_cfs_rq_load_avg() before this, since we rely on
+ * Must call update_cfs_rq_sched_avg() before this, since we rely on
  * cfs_rq->avg.last_update_time being current.
  */
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (!sched_feat(ATTACH_AGE_LOAD))
 		goto skip_aging;
@@ -3007,11 +3005,11 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * If we got migrated (either between CPUs or between cgroups) we'll
 	 * have aged the average right before clearing @last_update_time.
 	 *
-	 * Or we're fresh through post_init_entity_util_avg().
+	 * Or we're fresh through post_init_entity_sched_avg().
 	 */
 	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
+		__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+				   &se->avg, 0, 0, NULL);
 
 		/*
 		 * XXX: we could have just aged the entire load away if we've been
@@ -3030,18 +3028,18 @@ skip_aging:
 }
 
 /**
- * detach_entity_load_avg - detach this entity from its cfs_rq load avg
+ * detach_entity_sched_avg - detach this entity from its cfs_rq load avg
  * @cfs_rq: cfs_rq to detach from
  * @se: sched_entity to detach
  *
- * Must call update_cfs_rq_load_avg() before this, since we rely on
+ * Must call update_cfs_rq_sched_avg() before this, since we rely on
  * cfs_rq->avg.last_update_time being current.
  */
-static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+			   &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			   cfs_rq->curr == se, NULL);
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
@@ -3053,7 +3051,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 /* Add the load generated by se into cfs_rq's load average */
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3061,18 +3059,18 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	migrated = !sa->last_update_time;
 	if (!migrated) {
-		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
-			cfs_rq->curr == se, NULL);
+		__update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+				   se->on_rq * scale_load_down(se->load.weight),
+				   cfs_rq->curr == se, NULL);
 	}
 
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, !migrated);
+	decayed = update_cfs_rq_sched_avg(now, cfs_rq, !migrated);
 
 	cfs_rq->runnable_load_avg += sa->load_avg;
 	cfs_rq->runnable_load_sum += sa->load_sum;
 
 	if (migrated)
-		attach_entity_load_avg(cfs_rq, se);
+		attach_entity_sched_avg(cfs_rq, se);
 
 	if (decayed || migrated)
 		update_tg_load_avg(cfs_rq, 0);
@@ -3080,9 +3078,9 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
+	update_sched_avg(se, 1);
 
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
@@ -3115,24 +3113,25 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
  */
-static void remove_entity_load_avg(struct sched_entity *se)
+static void remove_entity_sched_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 last_update_time;
 
 	/*
 	 * tasks cannot exit without having gone through wake_up_new_task() ->
-	 * post_init_entity_util_avg() which will have added things to the
+	 * post_init_entity_sched_avg() which will have added things to the
 	 * cfs_rq, so we can remove unconditionally.
 	 *
 	 * Similarly for groups, they will have passed through
-	 * post_init_entity_util_avg() before unregister_sched_fair_group()
+	 * post_init_entity_sched_avg() before unregister_sched_fair_group()
 	 * calls this.
 	 */
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
 
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+	__update_sched_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
+			   &se->avg, 0, 0, NULL);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
 	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
 }
@@ -3152,12 +3151,12 @@ static int idle_balance(struct rq *this_rq);
 #else /* CONFIG_SMP */
 
 static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
+update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	return 0;
 }
 
-static inline void update_load_avg(struct sched_entity *se, int not_used)
+static inline void update_sched_avg(struct sched_entity *se, int not_used)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct rq *rq = rq_of(cfs_rq);
@@ -3166,15 +3165,15 @@ static inline void update_load_avg(struct sched_entity *se, int not_used)
 }
 
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void remove_entity_load_avg(struct sched_entity *se) {}
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void remove_entity_sched_avg(struct sched_entity *se) {}
 
 static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 
 static inline int idle_balance(struct rq *rq)
 {
@@ -3367,7 +3366,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_entity_sched_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -3446,7 +3445,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_entity_sched_avg(cfs_rq, se);
 
 	if (schedstat_enabled())
 		update_stats_dequeue(cfs_rq, se, flags);
@@ -3526,7 +3525,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (schedstat_enabled())
 			update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3630,7 +3629,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_sched_avg(prev, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3646,7 +3645,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_sched_avg(curr, 1);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4535,7 +4534,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4594,7 +4593,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -5496,7 +5495,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
 	 * will result in the wakee task is less decayed, but giving the wakee more
 	 * load sounds not bad.
 	 */
-	remove_entity_load_avg(&p->se);
+	remove_entity_sched_avg(&p->se);
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;
@@ -5507,7 +5506,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
 
 static void task_dead_fair(struct task_struct *p)
 {
-	remove_entity_load_avg(&p->se);
+	remove_entity_sched_avg(&p->se);
 }
 #endif /* CONFIG_SMP */
 
@@ -6388,7 +6387,7 @@ static void update_blocked_averages(int cpu)
 		if (throttled_hierarchy(cfs_rq))
 			continue;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
+		if (update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -6449,7 +6448,7 @@ static inline void update_blocked_averages(int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 	update_rq_clock(rq);
-	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
+	update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
@@ -8453,8 +8452,8 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
-	detach_entity_load_avg(cfs_rq, se);
+	tg_update = update_cfs_rq_sched_avg(now, cfs_rq, false);
+	detach_entity_sched_avg(cfs_rq, se);
 	if (tg_update)
 		update_tg_load_avg(cfs_rq, false);
 }
@@ -8475,8 +8474,8 @@ static void attach_task_cfs_rq(struct task_struct *p)
 #endif
 
 	/* Synchronize task with its cfs_rq */
-	tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
-	attach_entity_load_avg(cfs_rq, se);
+	tg_update = update_cfs_rq_sched_avg(now, cfs_rq, false);
+	attach_entity_sched_avg(cfs_rq, se);
 	if (tg_update)
 		update_tg_load_avg(cfs_rq, false);
 
@@ -8621,7 +8620,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 		init_cfs_rq(cfs_rq);
 		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
-		init_entity_runnable_average(se);
+		init_entity_sched_avg(se);
 	}
 
 	return 1;
@@ -8643,7 +8642,7 @@ void online_fair_sched_group(struct task_group *tg)
 		se = tg->se[i];
 
 		raw_spin_lock_irq(&rq->lock);
-		post_init_entity_util_avg(se);
+		post_init_entity_sched_avg(se);
 		sync_throttle(tg, i);
 		raw_spin_unlock_irq(&rq->lock);
 	}
@@ -8657,7 +8656,7 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 	for_each_possible_cpu(cpu) {
 		if (tg->se[cpu])
-			remove_entity_load_avg(tg->se[cpu]);
+			remove_entity_sched_avg(tg->se[cpu]);
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c64fc51..132a0fa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1325,8 +1325,8 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-extern void init_entity_runnable_average(struct sched_entity *se);
-extern void post_init_entity_util_avg(struct sched_entity *se);
+extern void init_entity_sched_avg(struct sched_entity *se);
+extern void post_init_entity_sched_avg(struct sched_entity *se);
 
 #ifdef CONFIG_NO_HZ_FULL
 extern bool sched_can_stop_tick(struct rq *rq);
@@ -1768,7 +1768,7 @@ DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
  * @max: Utilization ceiling.
  *
  * This function is called by the scheduler on every invocation of
- * update_load_avg() on the CPU whose utilization is being updated.
+ * update_sched_avg() on the CPU whose utilization is being updated.
  *
  * It can only be called from RCU-sched read-side critical sections.
  */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 05/10] sched/fair: Add __always_inline compiler attribute to __accumulate_sum()
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (3 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 04/10] sched/fair: Rename variable names for sched averages Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 06/10] sched/fair: Optimize __update_sched_avg() Yuyang Du
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

__accumulate_sum()'s caller and sibling functions are all inlined.
And it will be called almost every time in its caller. It does not
make sense if only it is not inlined.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06819bb..6b10fb0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2649,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __accumulate_sum(u64 n)
+static __always_inline u32 __accumulate_sum(u64 n)
 {
 	u32 contrib = 0;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 06/10] sched/fair: Optimize __update_sched_avg()
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (4 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 05/10] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion Yuyang Du
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

__update_sched_avg() has these steps:

  1. add the remainder of the last incomplete period
  2. decay old sum
  3. accumulate new sum in full periods since last_update_time
  4. add the current incomplete period
  5. update averages

Previously, we separately computed steps 1, 3, and 4, leading to
each one of them ugly in codes and costly in overhead.

Illustation:

             c1          c3           c4
             ^           ^            ^
             |           |            |
           |<->|<----------------->|<--->|
   ... |---x---|------| ... |------|-----x (now)

c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
in timing aspect of steps 1, 3, and 4 respectively.

With them, the accumulated contribution to load_sum, for example, is:

contrib = c1 * weight * freq_scaled;
contrib += c3 * weight * freq_scaled;
contrib += c4 * weight * freq_scaled;

Obviously, we can optimize the above by:

contrib = c1 + c3 + c4;
contrib *= weight * freq_scaled;

One issue is that c1 must be additionally decayed as opposed to
decaying it with step 2. However, peformance wise, the two approaches
should be on par with each other if you compare __decay_sum() with a
round of contrib accumulation. And overall, we definitely will save
tens of instructions, although the performance gain may not be observed
by benchmarks, whereas code wise, this apprach is obviously much simpler.

Code size (allyesconfig):

Before: kernel/sched/built-in.o 1031616
After : kernel/sched/built-in.o 1025504 (-6112B)

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  175 ++++++++++++++++++++++++++-------------------------
 1 file changed, 89 insertions(+), 86 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b10fb0..283e2c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2649,24 +2649,84 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static __always_inline u32 __accumulate_sum(u64 n)
+static __always_inline u32
+__accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
 {
-	u32 contrib = 0;
+	u32 contrib;
+
+	if (!periods)
+		return remainder - period_contrib;
 
-	if (likely(n <= SCHED_AVG_HALFLIFE))
-		return __accumulated_sum_N[n];
-	else if (unlikely(n >= SCHED_AVG_MAX_N))
+	if (unlikely(periods >= SCHED_AVG_MAX_N))
 		return SCHED_AVG_MAX;
 
-	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
-	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
-	n %= SCHED_AVG_HALFLIFE;
-	contrib = __decay_sum(contrib, n);
-	return contrib + __accumulated_sum_N[n];
+	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
+
+	periods -= 1;
+	if (likely(periods <= SCHED_AVG_HALFLIFE))
+		contrib = __accumulated_sum_N[periods];
+	else {
+		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
+		periods %= SCHED_AVG_HALFLIFE;
+		contrib = __decay_sum(contrib, periods);
+		contrib += __accumulated_sum_N[periods];
+	}
+
+	return contrib + remainder;
 }
 
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
+static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
+	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
+{
+	u32 contrib;
+	u64 periods;
+	unsigned long scale_freq, scale_cpu;
+
+	scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+	delta += sa->period_contrib;
+	periods = delta >> 10; /* A period is 1024us (~1ms) */
+
+	/*
+	 * Accumulating *_sum has two steps.
+	 *
+	 * Step 1: decay old *_sum if we crossed period boundaries.
+	 */
+	if (periods) {
+		sa->load_sum = __decay_sum(sa->load_sum, periods);
+		if (cfs_rq) {
+			cfs_rq->runnable_load_sum =
+				__decay_sum(cfs_rq->runnable_load_sum, periods);
+		}
+		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
+	}
+
+	/*
+	 * Step 2: accumulate new *_sum since last_update_time. This at most has
+	 * three parts (at least one part): (1) remainder of incomplete last
+	 * period, (2) full periods since (1), and (3) incomplete current period.
+	 *
+	 * Fortunately, we can (and should) do all these three at once.
+	 */
+	delta %= 1024;
+	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
+	sa->period_contrib = delta;
+
+	contrib = cap_scale(contrib, scale_freq);
+	if (weight) {
+		sa->load_sum += weight * contrib;
+		if (cfs_rq)
+			cfs_rq->runnable_load_sum += weight * contrib;
+	}
+	if (running)
+		sa->util_sum += contrib * scale_cpu;
+
+	return periods;
+}
+
 /*
  * We can represent the historical contribution to sched average as the
  * coefficients of a geometric series.  To do this we divide the history
@@ -2699,10 +2759,7 @@ static __always_inline int
 __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
-	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	u64 delta;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2723,81 +2780,27 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
-	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->period_contrib;
-	if (delta + delta_w >= 1024) {
-		decayed = 1;
-
-		/* how much left for next period will start over, we don't know yet */
-		sa->period_contrib = 0;
-
-		/*
-		 * Now that we know we're crossing a period boundary, figure
-		 * out how much from delta we need to complete the current
-		 * period and accrue it.
-		 */
-		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
-		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
-			if (cfs_rq) {
-				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
-			}
-		}
-		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
-
-		delta -= delta_w;
-
-		/* Figure out how many additional periods this update spans */
-		periods = delta / 1024;
-		delta %= 1024;
-
-		sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_sum =
-				__decay_sum(cfs_rq->runnable_load_sum, periods + 1);
-		}
-		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
-
-		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		contrib = __accumulate_sum(periods);
-		contrib = cap_scale(contrib, scale_freq);
-		if (weight) {
-			sa->load_sum += weight * contrib;
-			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
-		}
-		if (running)
-			sa->util_sum += contrib * scale_cpu;
-	}
-
-	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
-	if (weight) {
-		sa->load_sum += weight * scaled_delta;
-		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
-	}
-	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
-
-	sa->period_contrib += delta;
+	/*
+	 * Now we know we crossed measurement unit boundaries. The *_avg
+	 * accrues by two steps:
+	 *
+	 * Step 1: accumulate *_sum since last_update_time. If we haven't
+	 * crossed period boundaries, finish.
+	 */
+	if (!accumulate_sum(delta, sa, cfs_rq, cpu, weight, running))
+		return 0;
 
-	if (decayed) {
-		sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
-		}
-		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
+	/*
+	 * Step 2: update *_avg.
+	 */
+	sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
+	if (cfs_rq) {
+		cfs_rq->runnable_load_avg =
+			div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
 	}
+	sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 
-	return decayed;
+	return 1;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (5 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 06/10] sched/fair: Optimize __update_sched_avg() Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-25  7:11   ` Vincent Guittot
  2016-08-10  0:14 ` [PATCH v1 08/10] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

In __decay_sum(), the 64-bit to 32-bit variable conversion makes no
performance nor correctness use.

Minor cleanup and no functionality change.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 283e2c2..eea3349 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2616,16 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
  */
 static __always_inline u64 __decay_sum(u64 val, u64 n)
 {
-	unsigned int local_n;
-
 	if (!n)
 		return val;
 	else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
 		return 0;
 
-	/* after bounds checking we can collapse to 32-bit */
-	local_n = n;
-
 	/*
 	 * As y^HALFLIFE = 1/2, we can combine
 	 *    y^n = 1/2^(n/HALFLIFE) * y^(n%HALFLIFE)
@@ -2633,12 +2628,12 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
 	 *
 	 * To achieve constant time __decay_load.
 	 */
-	if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
-		val >>= local_n / SCHED_AVG_HALFLIFE;
-		local_n %= SCHED_AVG_HALFLIFE;
+	if (unlikely(n >= SCHED_AVG_HALFLIFE)) {
+		val >>= n / SCHED_AVG_HALFLIFE;
+		n %= SCHED_AVG_HALFLIFE;
 	}
 
-	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
+	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
 	return val;
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 08/10] sched/fair: Remove scale_load_down() for load_avg
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (6 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 09/10] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher fixed point range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased fixed point range. In this case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of the day, on the condition where it occurs, the
load average will not be useful anyway. And afterwards if the machine
can survive, the load will correct itself very quickly in no more than
~2 seconds (=32ms*64).

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   19 ++++++++++++++-----
 kernel/sched/fair.c   |   11 +++++------
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 912830c..9abfc62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1233,7 +1233,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- *   load_avg = runnable% * scale_load_down(load)
+ *   load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated load_avg of all runnable and
@@ -1241,7 +1241,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- *   load_avg = runnable% * scale_load_down(load) * freq%
+ *   load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalized to the highest frequency.
  *
@@ -1267,9 +1267,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761), always runnable on a single cfs_rq,
- * and should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32-bit kernels), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eea3349..033cf70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_sched_avg(struct sched_entity *se)
 	 * will definitely be update (after enqueue).
 	 */
 	sa->period_contrib = 1023;
-	sa->load_avg = scale_load_down(se->load.weight);
+	sa->load_avg = se->load.weight;
 	sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
@@ -2953,7 +2953,7 @@ update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	}
 
 	decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+				     cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -2978,8 +2978,7 @@ static inline void update_sched_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_sched_avg(now, cpu, &se->avg,
-			   se->on_rq * scale_load_down(se->load.weight),
+	__update_sched_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	if (update_cfs_rq_sched_avg(now, cfs_rq, true) && update_tg)
@@ -3036,7 +3035,7 @@ skip_aging:
 static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			   &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			   &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
@@ -3058,7 +3057,7 @@ enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	migrated = !sa->last_update_time;
 	if (!migrated) {
 		__update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-				   se->on_rq * scale_load_down(se->load.weight),
+				   se->on_rq * se->load.weight,
 				   cfs_rq->curr == se, NULL);
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 09/10] sched/fair: Rename scale_load() and scale_load_down()
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (7 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 08/10] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:14 ` [PATCH v1 10/10] sched/fair: Implement flat hierarchical structure for util_avg Yuyang Du
  2016-08-10  0:23 ` [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively. This helps us tag them
clearly and avoid confusion.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    8 ++++----
 kernel/sched/fair.c  |   13 +++++++++----
 kernel/sched/sched.h |   16 ++++++++--------
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30d429b..91fe97f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -735,12 +735,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (idle_policy(p->policy)) {
-		load->weight = scale_load(WEIGHT_IDLEPRIO);
+		load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
+	load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8290,7 +8290,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	return sched_group_set_shares(css_tg(css), user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8298,7 +8298,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 033cf70..1d4640c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
 	if (likely(lw->inv_weight))
 		return;
 
-	w = scale_load_down(lw->weight);
+	w = kernel_to_user_load(lw->weight);
 
 	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 		lw->inv_weight = 1;
@@ -210,10 +210,14 @@ static void __update_inv_weight(struct load_weight *lw)
  *
  * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
  * weight/lw.weight <= 1, and therefore our shift will also be positive.
+ *
+ * Note load.weight falls back to user load scale (i.e., NICE_0's load is
+ * 1024), instead of possibly increased kernel load scale (i.e., NICE_0's
+ * load is NICE_0_LOAD) due to multiplication and division efficiency.
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
 {
-	u64 fact = scale_load_down(weight);
+	u64 fact = kernel_to_user_load(weight);
 	int shift = WMULT_SHIFT;
 
 	__update_inv_weight(lw);
@@ -2509,7 +2513,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
 	 * the shares for small weight interactive tasks.
 	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	load = kernel_to_user_load(cfs_rq->load.weight);
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
@@ -8714,7 +8718,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+		       user_to_kernel_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 132a0fa..4723d4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -57,22 +57,22 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  */
 #ifdef CONFIG_64BIT
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		(w)
-# define scale_load_down(w)	(w)
+# define user_to_kernel_load(w)	(w)
+# define kernel_to_user_load(w)	(w)
 #endif
 
 /*
  * Task weight (visible to users) and its load (invisible to users) have
  * independent resolution, but they should be well calibrated. We use
- * scale_load() and scale_load_down(w) to convert between them. The
- * following must be true:
- *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them. The following must be true:
  *
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
 #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 10/10] sched/fair: Implement flat hierarchical structure for util_avg
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (8 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 09/10] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
@ 2016-08-10  0:14 ` Yuyang Du
  2016-08-10  0:23 ` [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
  10 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:14 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti, Yuyang Du

The util_avg follows task group's hierarchy to update, but the util_avgs
of all group entities and cfs_rqs except the top cfs_rq are needless and
thus never used. More importantly, the top cfs_rq's util_avg does not
reflect migration of a group task effectively, because the util_avg of
the task is accounted to its direct cfs_rq, and slowly trickle down to
the top cfs_rq.

So this patch proposes a flat task hierarchy for util_avg - all cfs tasks
affiliated to a rq are flat, removing their task group hierarchy, and
therefore the rq's util is the sum of all the cfs tasks.

Regarding overhead, the rq's util updates may be more costly - rq's util
updates can be very frequent, but we don't update any group entity's util,
so the net overhead should be evened out.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    1 +
 kernel/sched/debug.c |   13 +++--
 kernel/sched/fair.c  |  140 +++++++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h |    5 +-
 4 files changed, 101 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 91fe97f9..d215d04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7475,6 +7475,7 @@ void __init sched_init(void)
 #ifdef CONFIG_NO_HZ_FULL
 		rq->last_sched_tick = 0;
 #endif
+		atomic_long_set(&rq->removed_util_avg, 0);
 #endif /* CONFIG_SMP */
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2a0a999..14dc121 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -395,7 +395,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
-	P(se->avg.util_avg);
 #endif
 #undef PN
 #undef P
@@ -462,6 +461,14 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 		print_task(m, rq, p);
 	}
 	rcu_read_unlock();
+
+#ifdef CONFIG_SMP
+	SEQ_printf(m, "\nutilization: \n");
+	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
+			rq->avg.util_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed_util_avg",
+			atomic_long_read(&rq->removed_util_avg));
+#endif
 }
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
@@ -510,12 +517,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
-			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed_load_avg",
 			atomic_long_read(&cfs_rq->removed_load_avg));
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed_util_avg",
-			atomic_long_read(&cfs_rq->removed_util_avg));
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d4640c..b514129 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -703,47 +703,30 @@ static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *
 
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
- * based on the cfs_rq's current util_avg:
+ * based on the rq's current util_avg. To make the util_avg converge, we
+ * cap the util_avg of successive tasks to only 1/2 of the left utilization
+ * budget:
  *
- *   util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
- *
- * However, in many cases, the above util_avg does not give a desired
- * value. Moreover, the sum of the util_avgs may be divergent, such
- * as when the series is a harmonic series.
- *
- * To solve this problem, we also cap the util_avg of successive tasks to
- * only 1/2 of the left utilization budget:
- *
- *   util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
+ *   util_avg = (1024 - rq->avg.util_avg) / 2^n
  *
  * where n denotes the nth task.
  *
  * For example, a simplest series from the beginning would be like:
  *
- *  task  util_avg: 512, 256, 128,  64,  32,   16,    8, ...
- * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
- *
- * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
- * if util_avg > util_avg_cap.
+ *  task util_avg: 512, 256, 128,  64,  32,   16,    8, ...
+ *    rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
  */
 void post_init_entity_sched_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct rq *rq = rq_of(cfs_rq);
 	struct sched_avg *sa = &se->avg;
-	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
+	long cap = (long)(SCHED_CAPACITY_SCALE - rq->avg.util_avg) / 2;
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	int tg_update;
 
 	if (cap > 0) {
-		if (cfs_rq->avg.util_avg != 0) {
-			sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
-			sa->util_avg /= (cfs_rq->avg.load_avg + 1);
-
-			if (sa->util_avg > cap)
-				sa->util_avg = cap;
-		} else {
-			sa->util_avg = cap;
-		}
+		sa->util_avg = cap;
 		sa->util_sum = sa->util_avg * SCHED_AVG_MAX;
 	}
 
@@ -2676,8 +2659,45 @@ __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
 
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
-static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
-	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
+static __always_inline void
+__update_rq_util_avg(struct rq *rq, unsigned long scale_freq, unsigned long scale_cpu)
+{
+	u32 contrib;
+	struct sched_avg *sa = &rq->avg;
+	u64 delta, periods, now = rq_clock_task(rq);
+
+	/*
+	 * We have new delta (in ns unit) and periods (in ms unit).
+	 */
+	delta = (now - sa->last_update_time) >> 10;
+	if (!delta)
+		return;
+	sa->last_update_time = now;
+
+	delta += sa->period_contrib;
+	periods = delta >> 10;
+
+	/* Step 1: decay */
+	if (periods)
+		sa->util_sum = __decay_sum(sa->util_sum, periods);
+
+	/* Step 2: accumulate */
+	delta %= 1024;
+	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
+	sa->period_contrib = delta;
+
+	contrib = cap_scale(contrib, scale_freq);
+	if (rq->cfs.curr != NULL) /* new running */
+		sa->util_sum += contrib * scale_cpu;
+
+	/* Step 3: update avg */
+	if (periods)
+		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
+}
+
+static __always_inline u32
+accumulate_sum(u64 delta, struct sched_avg *sa, struct cfs_rq *cfs_rq, int cpu,
+	       unsigned long weight, int running, int update_util)
 {
 	u32 contrib;
 	u64 periods;
@@ -2696,11 +2716,11 @@ static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = __decay_sum(sa->load_sum, periods);
-		if (cfs_rq) {
+		if (cfs_rq)
 			cfs_rq->runnable_load_sum =
 				__decay_sum(cfs_rq->runnable_load_sum, periods);
-		}
-		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
+		else if (update_util)
+			sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
 	}
 
 	/*
@@ -2720,9 +2740,12 @@ static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
 		if (cfs_rq)
 			cfs_rq->runnable_load_sum += weight * contrib;
 	}
-	if (running)
+	if (running && update_util)
 		sa->util_sum += contrib * scale_cpu;
 
+	if (cfs_rq)
+		__update_rq_util_avg(rq_of(cfs_rq), scale_freq, scale_cpu);
+
 	return periods;
 }
 
@@ -2759,6 +2782,7 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta;
+	int update_util = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2780,24 +2804,30 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 	sa->last_update_time = now;
 
 	/*
+	 * We update util_sum together with load_sum iff it is a task
+	 */
+	if (!cfs_rq && entity_is_task(container_of(sa, struct sched_entity, avg)))
+		update_util = 1;
+
+	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
 	 * accrues by two steps:
 	 *
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, cfs_rq, cpu, weight, running))
+	if (!accumulate_sum(delta, sa, cfs_rq, cpu, weight, running, update_util))
 		return 0;
 
 	/*
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
-	if (cfs_rq) {
+	if (cfs_rq)
 		cfs_rq->runnable_load_avg =
 			div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
-	}
-	sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
+	else if (update_util)
+		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 
 	return 1;
 }
@@ -2897,8 +2927,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 		 *
 		 * See cpu_util().
 		 */
-		cpufreq_update_util(rq_clock(rq),
-				    min(cfs_rq->avg.util_avg, max), max);
+		cpufreq_update_util(rq_clock(rq), min(rq->avg.util_avg, max), max);
 	}
 }
 
@@ -2941,18 +2970,21 @@ update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed, removed_load = 0, removed_util = 0;
+	struct rq *rq = rq_of(cfs_rq);
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
+
 		sub_positive(&sa->load_avg, r);
 		sub_positive(&sa->load_sum, r * SCHED_AVG_MAX);
 		removed_load = 1;
 	}
 
-	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
-		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sub_positive(&sa->util_avg, r);
-		sub_positive(&sa->util_sum, r * SCHED_AVG_MAX);
+	if (atomic_long_read(&rq->removed_util_avg)) {
+		long r = atomic_long_xchg(&rq->removed_util_avg, 0);
+
+		sub_positive(&rq->avg.util_avg, r);
+		sub_positive(&rq->avg.util_sum, r * SCHED_AVG_MAX);
 		removed_util = 1;
 	}
 
@@ -2999,6 +3031,8 @@ static inline void update_sched_avg(struct sched_entity *se, int update_tg)
  */
 static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct rq *rq = rq_of(cfs_rq);
+
 	if (!sched_feat(ATTACH_AGE_LOAD))
 		goto skip_aging;
 
@@ -3022,8 +3056,8 @@ skip_aging:
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se->avg.load_sum;
-	cfs_rq->avg.util_avg += se->avg.util_avg;
-	cfs_rq->avg.util_sum += se->avg.util_sum;
+	rq->avg.util_avg += se->avg.util_avg;
+	rq->avg.util_sum += se->avg.util_sum;
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3038,14 +3072,16 @@ skip_aging:
  */
 static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct rq *rq = rq_of(cfs_rq);
+
 	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
 			   &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
-	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
-	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	sub_positive(&rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(&rq->avg.util_sum, se->avg.util_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3117,6 +3153,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 static void remove_entity_sched_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct rq *rq = rq_of(cfs_rq);
 	u64 last_update_time;
 
 	/*
@@ -3134,7 +3171,7 @@ static void remove_entity_sched_avg(struct sched_entity *se)
 	__update_sched_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
 			   &se->avg, 0, 0, NULL);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
-	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
+	atomic_long_add(se->avg.util_avg, &rq->removed_util_avg);
 }
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
@@ -5332,7 +5369,7 @@ done:
  * compare the utilization with the capacity of the CPU that is available for
  * CFS task (ie cpu_capacity).
  *
- * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * rq->avg.util_avg is the sum of running time of runnable tasks plus the
  * recent utilization of currently non-runnable tasks on a CPU. It represents
  * the amount of utilization of a CPU in the range [0..capacity_orig] where
  * capacity_orig is the cpu_capacity available at the highest frequency
@@ -5341,9 +5378,9 @@ done:
  * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
  * the running time on this CPU scaled by capacity_curr.
  *
- * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * Nevertheless, rq->avg.util_avg can be higher than capacity_curr or even
  * higher than capacity_orig because of unfortunate rounding in
- * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * rq->avg.util_avg or just after migrating tasks and new task wakeups until
  * the average stabilizes with the new running time. We need to check that the
  * utilization stays within the range of [0..capacity_orig] and cap it if
  * necessary. Without utilization capping, a group could be seen as overloaded
@@ -5354,7 +5391,7 @@ done:
  */
 static int cpu_util(int cpu)
 {
-	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+	unsigned long util = cpu_rq(cpu)->avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
 
 	return (util >= capacity) ? capacity : util;
@@ -8533,7 +8570,6 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 #ifdef CONFIG_SMP
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
-	atomic_long_set(&cfs_rq->removed_util_avg, 0);
 #endif
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4723d4a..f6785c1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -398,7 +398,7 @@ struct cfs_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
 #endif
-	atomic_long_t removed_load_avg, removed_util_avg;
+	atomic_long_t removed_load_avg;
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif
@@ -662,6 +662,9 @@ struct rq {
 
 	/* This is used to determine avg_idle's max value */
 	u64 max_idle_balance_cost;
+
+	struct sched_avg avg;
+	atomic_long_t removed_util_avg;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
                   ` (9 preceding siblings ...)
  2016-08-10  0:14 ` [PATCH v1 10/10] sched/fair: Implement flat hierarchical structure for util_avg Yuyang Du
@ 2016-08-10  0:23 ` Yuyang Du
  2016-08-22 23:26   ` Yuyang Du
  10 siblings, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-10  0:23 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Wed, Aug 10, 2016 at 08:14:45AM +0800, Yuyang Du wrote:
> Hi Peter,
> 
> I should have sent out my flat util hierarchy implementation long time ago,
> actually code was there but not rebased. I finally have time to do this,
> so here it is. There are also other proposals to solve migrated tasks' util
> mobility problem, such as the ones from Dietmar and Vincent.
> 
> The sched avgs computation optimization was initiated for the flat util thing,
> so I send them out together.
> 
> According to Morten and Ben's feedback, I removed 32-bit as a period's upper
> bound limit. So, thanks a lot to them.
> 

To compare the effectiveness of the flat util hierarchy, a simple experiment
was done: rt-app to generate a 50% duty-cycling workload (100us/200us), and
in the meantime a script to set the CPU affinity of the task, alternating
to taskset the task to run on CPU x and CPU y every 0.3 sec, so forcing the
task to ping-pong migrate. By auto-group and ssh itself, the task under test
is at the third level task group.

So compare the top cfs_rq's util_avg (group_hierarchy.jpg) vs. the rq's util_avg
(flat_hierarchy.jpg)

Thanks,
Yuyang

[-- Attachment #2: group_hierarchy.jpg --]
[-- Type: image/jpeg, Size: 48406 bytes --]

[-- Attachment #3: flat_hierarchy.jpg --]
[-- Type: image/jpeg, Size: 32646 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 02/10] documentation: Add scheduler/sched-avg.txt
  2016-08-10  0:14 ` [PATCH v1 02/10] documentation: Add scheduler/sched-avg.txt Yuyang Du
@ 2016-08-22 16:48   ` Randy Dunlap
  0 siblings, 0 replies; 30+ messages in thread
From: Randy Dunlap @ 2016-08-22 16:48 UTC (permalink / raw)
  To: Yuyang Du, peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti

On 08/09/16 17:14, Yuyang Du wrote:
> This doc file has the program to generate the constants to compute
> sched averages.
> 
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  Documentation/scheduler/sched-avg.txt |   94 +++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-avg.txt

I'll confirm that it builds and generates a buildable source file
as long as "u32" is understood (defined/typedef'd).

I would rather see it in tools/ as a C source file with comments as
follows:

// The following program is used to generate the constants for
// computing sched averages.

//	C program (compile with -lm)

> +#include <math.h>
> +#include <stdio.h>

Maybe even add
#include <stdint.h>

and
typedef uint32_t	u32;


thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-10  0:23 ` [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
@ 2016-08-22 23:26   ` Yuyang Du
  2016-08-23 13:28     ` Vincent Guittot
  0 siblings, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-22 23:26 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, umgwanakikbuti

Hi Peter and others,

Could you give this patchset a look?

Thanks,
Yuyang

On Wed, Aug 10, 2016 at 08:23:52AM +0800, Yuyang Du wrote:
> On Wed, Aug 10, 2016 at 08:14:45AM +0800, Yuyang Du wrote:
> > Hi Peter,
> > 
> > I should have sent out my flat util hierarchy implementation long time ago,
> > actually code was there but not rebased. I finally have time to do this,
> > so here it is. There are also other proposals to solve migrated tasks' util
> > mobility problem, such as the ones from Dietmar and Vincent.
> > 
> > The sched avgs computation optimization was initiated for the flat util thing,
> > so I send them out together.
> > 
> > According to Morten and Ben's feedback, I removed 32-bit as a period's upper
> > bound limit. So, thanks a lot to them.
> > 
> 
> To compare the effectiveness of the flat util hierarchy, a simple experiment
> was done: rt-app to generate a 50% duty-cycling workload (100us/200us), and
> in the meantime a script to set the CPU affinity of the task, alternating
> to taskset the task to run on CPU x and CPU y every 0.3 sec, so forcing the
> task to ping-pong migrate. By auto-group and ssh itself, the task under test
> is at the third level task group.
> 
> So compare the top cfs_rq's util_avg (group_hierarchy.jpg) vs. the rq's util_avg
> (flat_hierarchy.jpg)
> 
> Thanks,
> Yuyang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-22 23:26   ` Yuyang Du
@ 2016-08-23 13:28     ` Vincent Guittot
  2016-08-23 14:13       ` Peter Zijlstra
  2016-08-23 19:01       ` Yuyang Du
  0 siblings, 2 replies; 30+ messages in thread
From: Vincent Guittot @ 2016-08-23 13:28 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On 23 August 2016 at 01:26, Yuyang Du <yuyang.du@intel.com> wrote:
>
> Hi Peter and others,
>
> Could you give this patchset a look?

Hi Yuyang,

I still wonder if using a flat util hierarchy is the right solution to
solve this problem with utilization and task group. I have noticed
exact same issues with load that generates weird task placement
decision and i think that we should probably try to solve both wrong
behavior with same mechanism. but this is not possible with flat
hierarchy for load

Let me take an example.
TA is a always running task on CPU1 in group /root/level1/
TB wakes up on CPU0 and moves TA into group /root/level2/
Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
Then, TB forks a new task TC. TC will probably be schedule on CPU1
because its root cfs_rq's runnable_load_avg is null and CPU1 is the
next CPU after CPU0

Similar behavior can happen when TA migrates

Beside flat utilization consideration, i'm going to have a look at the
optimization part

Regards,
Vincent

>
>
> Thanks,
> Yuyang
>
> On Wed, Aug 10, 2016 at 08:23:52AM +0800, Yuyang Du wrote:
> > On Wed, Aug 10, 2016 at 08:14:45AM +0800, Yuyang Du wrote:
> > > Hi Peter,
> > >
> > > I should have sent out my flat util hierarchy implementation long time ago,
> > > actually code was there but not rebased. I finally have time to do this,
> > > so here it is. There are also other proposals to solve migrated tasks' util
> > > mobility problem, such as the ones from Dietmar and Vincent.
> > >
> > > The sched avgs computation optimization was initiated for the flat util thing,
> > > so I send them out together.
> > >
> > > According to Morten and Ben's feedback, I removed 32-bit as a period's upper
> > > bound limit. So, thanks a lot to them.
> > >
> >
> > To compare the effectiveness of the flat util hierarchy, a simple experiment
> > was done: rt-app to generate a 50% duty-cycling workload (100us/200us), and
> > in the meantime a script to set the CPU affinity of the task, alternating
> > to taskset the task to run on CPU x and CPU y every 0.3 sec, so forcing the
> > task to ping-pong migrate. By auto-group and ssh itself, the task under test
> > is at the third level task group.
> >
> > So compare the top cfs_rq's util_avg (group_hierarchy.jpg) vs. the rq's util_avg
> > (flat_hierarchy.jpg)
> >
> > Thanks,
> > Yuyang
>
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 13:28     ` Vincent Guittot
@ 2016-08-23 14:13       ` Peter Zijlstra
  2016-08-23 14:45         ` Vincent Guittot
  2016-08-23 19:09         ` Yuyang Du
  2016-08-23 19:01       ` Yuyang Du
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2016-08-23 14:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
> I still wonder if using a flat util hierarchy is the right solution to
> solve this problem with utilization and task group. I have noticed
> exact same issues with load that generates weird task placement
> decision and i think that we should probably try to solve both wrong
> behavior with same mechanism. but this is not possible with flat
> hierarchy for load
> 
> Let me take an example.
> TA is a always running task on CPU1 in group /root/level1/
> TB wakes up on CPU0 and moves TA into group /root/level2/
> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.

Because while we migrate the load_avg on /root/level2, we do not
propagate the load_avg up the hierarchy?

And always propagating everyrthing up will indeed also fix the
utilization issue.

Of course, doing that propagation has its costs..

Didn't you post a patch doing just this a while ago?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 14:13       ` Peter Zijlstra
@ 2016-08-23 14:45         ` Vincent Guittot
  2016-08-23 15:39           ` Dietmar Eggemann
  2016-08-24  8:54           ` Morten Rasmussen
  2016-08-23 19:09         ` Yuyang Du
  1 sibling, 2 replies; 30+ messages in thread
From: Vincent Guittot @ 2016-08-23 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On 23 August 2016 at 16:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
>> I still wonder if using a flat util hierarchy is the right solution to
>> solve this problem with utilization and task group. I have noticed
>> exact same issues with load that generates weird task placement
>> decision and i think that we should probably try to solve both wrong
>> behavior with same mechanism. but this is not possible with flat
>> hierarchy for load
>>
>> Let me take an example.
>> TA is a always running task on CPU1 in group /root/level1/
>> TB wakes up on CPU0 and moves TA into group /root/level2/
>> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
>
> Because while we migrate the load_avg on /root/level2, we do not
> propagate the load_avg up the hierarchy?

yes. At now, the load of a cfs_rq and the load of its sched_entity
that represents it at parent level are disconnected

>
> And always propagating everyrthing up will indeed also fix the
> utilization issue.
>
> Of course, doing that propagation has its costs..

yes, that's the counterpart

>
> Didn't you post a patch doing just this a while ago?

My patch was doing that but only for utilization and i have start to
work on adding the propagation of load as well

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 14:45         ` Vincent Guittot
@ 2016-08-23 15:39           ` Dietmar Eggemann
  2016-08-29  1:37             ` Yuyang Du
  2016-08-24  8:54           ` Morten Rasmussen
  1 sibling, 1 reply; 30+ messages in thread
From: Dietmar Eggemann @ 2016-08-23 15:39 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Matt Fleming, Mike Galbraith

On 23/08/16 15:45, Vincent Guittot wrote:
> On 23 August 2016 at 16:13, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
>>> I still wonder if using a flat util hierarchy is the right solution to
>>> solve this problem with utilization and task group. I have noticed
>>> exact same issues with load that generates weird task placement
>>> decision and i think that we should probably try to solve both wrong
>>> behavior with same mechanism. but this is not possible with flat
>>> hierarchy for load
>>>
>>> Let me take an example.
>>> TA is a always running task on CPU1 in group /root/level1/
>>> TB wakes up on CPU0 and moves TA into group /root/level2/
>>> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
>>
>> Because while we migrate the load_avg on /root/level2, we do not
>> propagate the load_avg up the hierarchy?
> 
> yes. At now, the load of a cfs_rq and the load of its sched_entity
> that represents it at parent level are disconnected

I guess you say 'disconnected' because cfs_rq and se (w/ cfs_rq eq.
se->my_q) are now independent pelt signals where as before the rewrite
they were 'connected' for load via __update_tg_runnable_avg(),
__update_group_entity_contrib() in __update_entity_load_avg_contrib()
and for utilization via 'se->avg.utilization_avg_contrib =
group_cfs_rq(se)->utilization_load_avg' in
__update_entity_utilization_avg_contrib().

IMHO, there was also this 'connection' between se and cfs_rq (w/
se->cfs_rq eq. cfs_rq) in update_entity_load_avg(, update_cfs_rq = 1)
which guaranteed that the change in the se was immediately visible on
the cfs_rq representing the parent task group.

 if (se->on_rq) {
   cfs_rq->runnable_load_avg += contrib_delta;
   cfs_rq->utilization_load_avg += utilization_delta;
 }

I guess, these two things somehow belonged together to achieve this
load/util propagation.

>> And always propagating everyrthing up will indeed also fix the
>> utilization issue.
>>
>> Of course, doing that propagation has its costs..
> 
> yes, that's the counterpart
> 
>>
>> Didn't you post a patch doing just this a while ago?
> 
> My patch was doing that but only for utilization and i have start to
> work on adding the propagation of load as well
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 13:28     ` Vincent Guittot
  2016-08-23 14:13       ` Peter Zijlstra
@ 2016-08-23 19:01       ` Yuyang Du
  1 sibling, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-23 19:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

Hi Vincent,

On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
> I still wonder if using a flat util hierarchy is the right solution to
> solve this problem with utilization and task group. I have noticed
> exact same issues with load that generates weird task placement
> decision and i think that we should probably try to solve both wrong
> behavior with same mechanism. but this is not possible with flat
> hierarchy for load

I agree both util and load have the same hierarchical propagation
problem.

But util and load are different with respect to task group distribution
among CPUs and along hierarchical structure. Util is "fundamentally"
flat (CPU's util = tasks' util), so it's pretty natural as well as
simple to implement a flat hierarchy util. And because of that, I
feel util propagating up the hierarchical structure seems unnecessary.

It might be better to have a converged mechanism to solve both, but
it shouldn't be necessary. Right?

> Let me take an example.
> TA is a always running task on CPU1 in group /root/level1/
> TB wakes up on CPU0 and moves TA into group /root/level2/
> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
> Then, TB forks a new task TC. TC will probably be schedule on CPU1
> because its root cfs_rq's runnable_load_avg is null and CPU1 is the
> next CPU after CPU0
> 
> Similar behavior can happen when TA migrates
> 
> Beside flat utilization consideration, i'm going to have a look at the

Many thanks.

Yuyang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 14:13       ` Peter Zijlstra
  2016-08-23 14:45         ` Vincent Guittot
@ 2016-08-23 19:09         ` Yuyang Du
  1 sibling, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-23 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Tue, Aug 23, 2016 at 04:13:41PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
> > I still wonder if using a flat util hierarchy is the right solution to
> > solve this problem with utilization and task group. I have noticed
> > exact same issues with load that generates weird task placement
> > decision and i think that we should probably try to solve both wrong
> > behavior with same mechanism. but this is not possible with flat
> > hierarchy for load
> > 
> > Let me take an example.
> > TA is a always running task on CPU1 in group /root/level1/
> > TB wakes up on CPU0 and moves TA into group /root/level2/
> > Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
> 
> Because while we migrate the load_avg on /root/level2, we do not
> propagate the load_avg up the hierarchy?
> 
> And always propagating everyrthing up will indeed also fix the
> utilization issue.

Yes, but for util it's actually irrespective to the number of
hirarchical levels, just propagating directly to the top cfs_rq
or simply rq will do. In other words, it's flat :)
 
> Of course, doing that propagation has its costs..
> 
> Didn't you post a patch doing just this a while ago?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 14:45         ` Vincent Guittot
  2016-08-23 15:39           ` Dietmar Eggemann
@ 2016-08-24  8:54           ` Morten Rasmussen
  2016-08-24  9:48             ` Vincent Guittot
  2016-08-29 19:00             ` Yuyang Du
  1 sibling, 2 replies; 30+ messages in thread
From: Morten Rasmussen @ 2016-08-24  8:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Yuyang Du, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Tue, Aug 23, 2016 at 04:45:57PM +0200, Vincent Guittot wrote:
> On 23 August 2016 at 16:13, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
> >> I still wonder if using a flat util hierarchy is the right solution to
> >> solve this problem with utilization and task group. I have noticed
> >> exact same issues with load that generates weird task placement
> >> decision and i think that we should probably try to solve both wrong
> >> behavior with same mechanism. but this is not possible with flat
> >> hierarchy for load
> >>
> >> Let me take an example.
> >> TA is a always running task on CPU1 in group /root/level1/
> >> TB wakes up on CPU0 and moves TA into group /root/level2/
> >> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
> >
> > Because while we migrate the load_avg on /root/level2, we do not
> > propagate the load_avg up the hierarchy?
> 
> yes. At now, the load of a cfs_rq and the load of its sched_entity
> that represents it at parent level are disconnected
> 
> >
> > And always propagating everyrthing up will indeed also fix the
> > utilization issue.
> >
> > Of course, doing that propagation has its costs..
> 
> yes, that's the counterpart
> 
> >
> > Didn't you post a patch doing just this a while ago?
> 
> My patch was doing that but only for utilization and i have start to
> work on adding the propagation of load as well

As Dietmar mentioned already, the 'disconnect' is a feature of the PELT
rewrite. Paul and Ben's original implementation had full propagation up
and down the hierarchy. IIRC, one of the key points of the rewrite was
more 'stable' signals, which we would loose by re-introducing immediate
updates throughout hierarchy.

It is a significant change to group scheduling, so I'm a bit surprised
that nobody has observed any problems post the rewrite. But maybe most
users don't care about the load-balance being slightly off when tasks
have migrated or new tasks are added to a group.

If we want to re-introduce propagation of both load and utilization I
would suggest that we just look at the original implementation. It
seemed to work.

Handling utilization and load differently will inevitably result in more
code. The 'flat hierarchy' approach seems slightly less complicated, but
it prevents us from using group utilization later should we wish to do
so. It might for example become useful for the schedutil cpufreq
governor should it ever consider selecting frequencies differently based
on whether the current task is in a (specific) group or not.

Morten

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-24  8:54           ` Morten Rasmussen
@ 2016-08-24  9:48             ` Vincent Guittot
  2016-08-29 19:00             ` Yuyang Du
  1 sibling, 0 replies; 30+ messages in thread
From: Vincent Guittot @ 2016-08-24  9:48 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Yuyang Du, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On 24 August 2016 at 10:54, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Aug 23, 2016 at 04:45:57PM +0200, Vincent Guittot wrote:
>> On 23 August 2016 at 16:13, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
>> >> I still wonder if using a flat util hierarchy is the right solution to
>> >> solve this problem with utilization and task group. I have noticed
>> >> exact same issues with load that generates weird task placement
>> >> decision and i think that we should probably try to solve both wrong
>> >> behavior with same mechanism. but this is not possible with flat
>> >> hierarchy for load
>> >>
>> >> Let me take an example.
>> >> TA is a always running task on CPU1 in group /root/level1/
>> >> TB wakes up on CPU0 and moves TA into group /root/level2/
>> >> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
>> >
>> > Because while we migrate the load_avg on /root/level2, we do not
>> > propagate the load_avg up the hierarchy?
>>
>> yes. At now, the load of a cfs_rq and the load of its sched_entity
>> that represents it at parent level are disconnected
>>
>> >
>> > And always propagating everyrthing up will indeed also fix the
>> > utilization issue.
>> >
>> > Of course, doing that propagation has its costs..
>>
>> yes, that's the counterpart
>>
>> >
>> > Didn't you post a patch doing just this a while ago?
>>
>> My patch was doing that but only for utilization and i have start to
>> work on adding the propagation of load as well
>
> As Dietmar mentioned already, the 'disconnect' is a feature of the PELT
> rewrite. Paul and Ben's original implementation had full propagation up
> and down the hierarchy. IIRC, one of the key points of the rewrite was
> more 'stable' signals, which we would loose by re-introducing immediate
> updates throughout hierarchy.
>
> It is a significant change to group scheduling, so I'm a bit surprised
> that nobody has observed any problems post the rewrite. But maybe most
> users don't care about the load-balance being slightly off when tasks
> have migrated or new tasks are added to a group.
>
> If we want to re-introduce propagation of both load and utilization I
> would suggest that we just look at the original implementation. It
> seemed to work.

The previous implementation was propagating the changes of load and
utilization of an entity for most of its update whereas we only need
to sync when entity is attached/detached or migrated

>
> Handling utilization and load differently will inevitably result in more
> code. The 'flat hierarchy' approach seems slightly less complicated, but
> it prevents us from using group utilization later should we wish to do
> so. It might for example become useful for the schedutil cpufreq
> governor should it ever consider selecting frequencies differently based
> on whether the current task is in a (specific) group or not.
>
> Morten

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347
  2016-08-10  0:14 ` [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347 Yuyang Du
@ 2016-08-24 16:01   ` Vincent Guittot
  2016-08-29  0:07     ` Yuyang Du
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Guittot @ 2016-08-24 16:01 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On 10 August 2016 at 02:14, Yuyang Du <yuyang.du@intel.com> wrote:
> In commit 5b51f2f80b3b906ce59bd4dce6eca3c7f34cb1b9
> Author: Paul Turner <pjt@google.com>
> Date:   Thu Oct 4 13:18:32 2012 +0200
>
>     sched: Make __update_entity_runnable_avg() fast
>
> Paul has a program to compute LOAD_AVG_MAX_N, which basically means
> how many periods (at least) are needed for LOAD_AVG_MAX, and the result
> of calc_conv(1024) is 345:
>
>   long mult_inv(long c, int n) {
>         return (c * runnable_avg_yN_inv[n]) >>  WMULT_SHIFT;
>   }
>
>   void calc_conv(long n) {
>         long old_n;
>         int i = -1;
>
>         printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
>         do {
>                 old_n = n;
>                 n = mult_inv(n, 1) + 1024;
>                 i++;
>         } while (n != old_n);
>         printf("%d> %ld\n", i - 1, n);
>         printf("\n");
>   }
>
> The initial value of i is -1, which should be 1 as far as I can tell.
> Accordingly, the final result of LOAD_AVG_MAX_N should be changed
> from 345 to 347.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  kernel/sched/fair.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4088eed..997297b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -666,7 +666,7 @@ static unsigned long task_h_load(struct task_struct *p);
>   */
>  #define LOAD_AVG_PERIOD 32
>  #define LOAD_AVG_MAX 47742 /* maximum possible load avg */
> -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
> +#define LOAD_AVG_MAX_N 347 /* number of full periods to produce LOAD_AVG_MAX */

Acked-by: Vincent Guittot <vincent.guitto@linar.org>

>
>  /* Give new sched_entity start runnable values to heavy its load in infant time */
>  void init_entity_runnable_average(struct sched_entity *se)
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion
  2016-08-10  0:14 ` [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion Yuyang Du
@ 2016-08-25  7:11   ` Vincent Guittot
  2016-08-29  1:00     ` Yuyang Du
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Guittot @ 2016-08-25  7:11 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On 10 August 2016 at 02:14, Yuyang Du <yuyang.du@intel.com> wrote:
> In __decay_sum(), the 64-bit to 32-bit variable conversion makes no
> performance nor correctness use.

Are you sure that there is no impact on 32bits system ?
Even this will be mainly shift manipulation


>
> Minor cleanup and no functionality change.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  kernel/sched/fair.c |   13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 283e2c2..eea3349 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2616,16 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
>   */
>  static __always_inline u64 __decay_sum(u64 val, u64 n)
>  {
> -       unsigned int local_n;
> -
>         if (!n)
>                 return val;
>         else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
>                 return 0;
>
> -       /* after bounds checking we can collapse to 32-bit */
> -       local_n = n;
> -
>         /*
>          * As y^HALFLIFE = 1/2, we can combine
>          *    y^n = 1/2^(n/HALFLIFE) * y^(n%HALFLIFE)
> @@ -2633,12 +2628,12 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
>          *
>          * To achieve constant time __decay_load.
>          */
> -       if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
> -               val >>= local_n / SCHED_AVG_HALFLIFE;
> -               local_n %= SCHED_AVG_HALFLIFE;
> +       if (unlikely(n >= SCHED_AVG_HALFLIFE)) {
> +               val >>= n / SCHED_AVG_HALFLIFE;
> +               n %= SCHED_AVG_HALFLIFE;
>         }
>
> -       val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
> +       val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
>         return val;
>  }
>
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347
  2016-08-24 16:01   ` Vincent Guittot
@ 2016-08-29  0:07     ` Yuyang Du
  0 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-29  0:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Wed, Aug 24, 2016 at 06:01:32PM +0200, Vincent Guittot wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4088eed..997297b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -666,7 +666,7 @@ static unsigned long task_h_load(struct task_struct *p);
> >   */
> >  #define LOAD_AVG_PERIOD 32
> >  #define LOAD_AVG_MAX 47742 /* maximum possible load avg */
> > -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
> > +#define LOAD_AVG_MAX_N 347 /* number of full periods to produce LOAD_AVG_MAX */
> 
> Acked-by: Vincent Guittot <vincent.guitto@linar.org>

Thanks, Vincent.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion
  2016-08-25  7:11   ` Vincent Guittot
@ 2016-08-29  1:00     ` Yuyang Du
  0 siblings, 0 replies; 30+ messages in thread
From: Yuyang Du @ 2016-08-29  1:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Thu, Aug 25, 2016 at 09:11:56AM +0200, Vincent Guittot wrote:
> On 10 August 2016 at 02:14, Yuyang Du <yuyang.du@intel.com> wrote:
> > In __decay_sum(), the 64-bit to 32-bit variable conversion makes no
> > performance nor correctness use.
> 
> Are you sure that there is no impact on 32bits system ?
> Even this will be mainly shift manipulation
> 
 
Honestly, I didn't consider 32-bit system at all, :(

And after all, it's 64-bit shift 32-bit vs. 64-bit shift 64-bit operations.
I'm not an expert, but I intend to believe they are about the same, because
the arithmetic overhead seems to lie in the left part of the shift operation,
both of which are 64-bit (correct me if I'm wrong).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-23 15:39           ` Dietmar Eggemann
@ 2016-08-29  1:37             ` Yuyang Du
  2016-09-01 18:32               ` Dietmar Eggemann
  0 siblings, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-29  1:37 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Matt Fleming,
	Mike Galbraith

On Tue, Aug 23, 2016 at 04:39:51PM +0100, Dietmar Eggemann wrote:
> On 23/08/16 15:45, Vincent Guittot wrote:
> > On 23 August 2016 at 16:13, Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
> >>> I still wonder if using a flat util hierarchy is the right solution to
> >>> solve this problem with utilization and task group. I have noticed
> >>> exact same issues with load that generates weird task placement
> >>> decision and i think that we should probably try to solve both wrong
> >>> behavior with same mechanism. but this is not possible with flat
> >>> hierarchy for load
> >>>
> >>> Let me take an example.
> >>> TA is a always running task on CPU1 in group /root/level1/
> >>> TB wakes up on CPU0 and moves TA into group /root/level2/
> >>> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
> >>
> >> Because while we migrate the load_avg on /root/level2, we do not
> >> propagate the load_avg up the hierarchy?
> > 
> > yes. At now, the load of a cfs_rq and the load of its sched_entity
> > that represents it at parent level are disconnected
> 
> I guess you say 'disconnected' because cfs_rq and se (w/ cfs_rq eq.
> se->my_q) are now independent pelt signals where as before the rewrite
> they were 'connected' for load via __update_tg_runnable_avg(),
> __update_group_entity_contrib() in __update_entity_load_avg_contrib()
> and for utilization via 'se->avg.utilization_avg_contrib =
> group_cfs_rq(se)->utilization_load_avg' in
> __update_entity_utilization_avg_contrib().
 
I don't understand what exactly "disconnected" means, but with respect to
group_entity's load_avg, nothing is changed essentially:

group_entity_load_avg = my_cfs_rq_load_avg / tg_load_avg * tg_shares

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-24  8:54           ` Morten Rasmussen
  2016-08-24  9:48             ` Vincent Guittot
@ 2016-08-29 19:00             ` Yuyang Du
  2016-09-01 14:22               ` Morten Rasmussen
  1 sibling, 1 reply; 30+ messages in thread
From: Yuyang Du @ 2016-08-29 19:00 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Wed, Aug 24, 2016 at 09:54:35AM +0100, Morten Rasmussen wrote:
> As Dietmar mentioned already, the 'disconnect' is a feature of the PELT
> rewrite. Paul and Ben's original implementation had full propagation up
> and down the hierarchy. IIRC, one of the key points of the rewrite was
> more 'stable' signals, which we would loose by re-introducing immediate
> updates throughout hierarchy.
 
As I mentioned earlier, no essential change! A feature perhaps is: the
rewrite takes into account the runnable ratio.

E.g., let there be a group having one task with share 1024, if the task
sticks to one CPU, and the task is runnable 50% of the time.

With the old implementation, the group_entity_load_avg is 1024; but with
the rewritten implementation, the group_entity_load_avg is 512. Isn't this
good?

If the task migrates, the old implementation will still be 1024 on the new
CPU, but the rewritten implementation will transition to 512, albeit taking
0.1+ second time, which we are now addressing. Isn't this good?

> It is a significant change to group scheduling, so I'm a bit surprised
> that nobody has observed any problems post the rewrite. But maybe most
> users don't care about the load-balance being slightly off when tasks
> have migrated or new tasks are added to a group.
 
I don't understand what you are saying.

> If we want to re-introduce propagation of both load and utilization I
> would suggest that we just look at the original implementation. It
> seemed to work.
>
> Handling utilization and load differently will inevitably result in more
> code. The 'flat hierarchy' approach seems slightly less complicated, but
> it prevents us from using group utilization later should we wish to do
> so. It might for example become useful for the schedutil cpufreq
> governor should it ever consider selecting frequencies differently based
> on whether the current task is in a (specific) group or not.

I understand group util may have some usage should you attempt to do so, I'm
not sure how realistic it is.

Nothing prevents you from knowing the current task is from which (specific)
group or not.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-29 19:00             ` Yuyang Du
@ 2016-09-01 14:22               ` Morten Rasmussen
  0 siblings, 0 replies; 30+ messages in thread
From: Morten Rasmussen @ 2016-09-01 14:22 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Dietmar Eggemann, Matt Fleming,
	Mike Galbraith

On Tue, Aug 30, 2016 at 03:00:58AM +0800, Yuyang Du wrote:
> On Wed, Aug 24, 2016 at 09:54:35AM +0100, Morten Rasmussen wrote:
> > As Dietmar mentioned already, the 'disconnect' is a feature of the PELT
> > rewrite. Paul and Ben's original implementation had full propagation up
> > and down the hierarchy. IIRC, one of the key points of the rewrite was
> > more 'stable' signals, which we would loose by re-introducing immediate
> > updates throughout hierarchy.
>  
> As I mentioned earlier, no essential change!

I don't quite agree with that, it is a very significant change and you
describe the problem yourself further down.

> A feature perhaps is: the
> rewrite takes into account the runnable ratio.
> 
> E.g., let there be a group having one task with share 1024, if the task
> sticks to one CPU, and the task is runnable 50% of the time.
> 
> With the old implementation, the group_entity_load_avg is 1024; but with
> the rewritten implementation, the group_entity_load_avg is 512. Isn't this
> good?
> 
> If the task migrates, the old implementation will still be 1024 on the new
> CPU, but the rewritten implementation will transition to 512, albeit taking
> 0.1+ second time, which we are now addressing. Isn't this good?

No, this is exactly the problem. When you migrate a task, you don't see
the effect immediately after the rewrite. You may have multiple cpus
load-balancing in the meantime before the group load/utilization have
settled after the migration. They will see a wrong picture of the
load/utilization. The cpu where the task was migrated from, appears to
still have load/utilization, and the cpu it has migrated to appears
nearly idle while in fact it is busy running the migrated task. That may
lead other cpus to put even more tasks on the cpu where the task was
migrated to.

We need the cpu load/utilization to be updated immediately to make
meaningful load-balancing decisions. That was ensured by updating the
entire group hierarchy when a task was added/removed before the rewrite.
The task entity contribution was rippling all the way down to the root
cfs_rq load/utilization immediately as all the group cfs_rq
load/utilization directly impacted the group entity load/utilization.
They were sort of 'connected'.

That is no longer the case, and that is why we have problems with
getting an accurate estimate of cpu utilization when we have task
groups.

> 
> > It is a significant change to group scheduling, so I'm a bit surprised
> > that nobody has observed any problems post the rewrite. But maybe most
> > users don't care about the load-balance being slightly off when tasks
> > have migrated or new tasks are added to a group.
>  
> I don't understand what you are saying.

See above. Maybe nobody has noticed the difference if they primarily
have use-cases with many long-running tasks.

> 
> > If we want to re-introduce propagation of both load and utilization I
> > would suggest that we just look at the original implementation. It
> > seemed to work.
> >
> > Handling utilization and load differently will inevitably result in more
> > code. The 'flat hierarchy' approach seems slightly less complicated, but
> > it prevents us from using group utilization later should we wish to do
> > so. It might for example become useful for the schedutil cpufreq
> > governor should it ever consider selecting frequencies differently based
> > on whether the current task is in a (specific) group or not.
> 
> I understand group util may have some usage should you attempt to do so, I'm
> not sure how realistic it is.

I'm not sure if it will be useful either, so far it is just an idea.

> Nothing prevents you from knowing the current task is from which (specific)
> group or not.

True, but it might be useful to know the utilization of the entire
group. However, in that case I guess the system-wide group utilization
might be more useful than the group utilization on the particular cpu.
But I haven't thought it through.

Morten

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy
  2016-08-29  1:37             ` Yuyang Du
@ 2016-09-01 18:32               ` Dietmar Eggemann
  0 siblings, 0 replies; 30+ messages in thread
From: Dietmar Eggemann @ 2016-09-01 18:32 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Matt Fleming,
	Mike Galbraith

On 29/08/16 02:37, Yuyang Du wrote:
> On Tue, Aug 23, 2016 at 04:39:51PM +0100, Dietmar Eggemann wrote:
>> On 23/08/16 15:45, Vincent Guittot wrote:
>>> On 23 August 2016 at 16:13, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Tue, Aug 23, 2016 at 03:28:19PM +0200, Vincent Guittot wrote:
>>>>> I still wonder if using a flat util hierarchy is the right solution to
>>>>> solve this problem with utilization and task group. I have noticed
>>>>> exact same issues with load that generates weird task placement
>>>>> decision and i think that we should probably try to solve both wrong
>>>>> behavior with same mechanism. but this is not possible with flat
>>>>> hierarchy for load
>>>>>
>>>>> Let me take an example.
>>>>> TA is a always running task on CPU1 in group /root/level1/
>>>>> TB wakes up on CPU0 and moves TA into group /root/level2/
>>>>> Even if TA stays on CPU1, runnable_load_avg of CPU1 root cfs rq will become 0.
>>>>
>>>> Because while we migrate the load_avg on /root/level2, we do not
>>>> propagate the load_avg up the hierarchy?
>>>
>>> yes. At now, the load of a cfs_rq and the load of its sched_entity
>>> that represents it at parent level are disconnected
>>
>> I guess you say 'disconnected' because cfs_rq and se (w/ cfs_rq eq.
>> se->my_q) are now independent pelt signals where as before the rewrite
>> they were 'connected' for load via __update_tg_runnable_avg(),
>> __update_group_entity_contrib() in __update_entity_load_avg_contrib()
>> and for utilization via 'se->avg.utilization_avg_contrib =
>> group_cfs_rq(se)->utilization_load_avg' in
>> __update_entity_utilization_avg_contrib().
>  
> I don't understand what exactly "disconnected" means, but with respect to
> group_entity's load_avg, nothing is changed essentially:
> 

True but this is the update_cfs_shares() side of things.

> group_entity_load_avg = my_cfs_rq_load_avg / tg_load_avg * tg_shares
> 

'Connected' for me in the old implementation stands for the fact that
for every call to update_entity_load_avg(se, 1) (with
!entity_is_task(se)), the group cfs_rq (se->my_q) contribution towards
the se is updated in __update_entity_[load|utilization]_avg_contrib and
the returning delta is added to the appropriate cfs_rq (se->cfs_rq)
values immediately. Doing this in for_each_sched_entity(se) gives this
nice propagation effect in direction root cfs_rq.

In the new implementation se, se->my_q and se->cfs_rq have independent
PELT signals, hence the 'disconnected'.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-09-01 20:58 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-10  0:14 [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
2016-08-10  0:14 ` [PATCH v1 01/10] sched/fair: Chance LOAD_AVG_MAX_N from 345 to 347 Yuyang Du
2016-08-24 16:01   ` Vincent Guittot
2016-08-29  0:07     ` Yuyang Du
2016-08-10  0:14 ` [PATCH v1 02/10] documentation: Add scheduler/sched-avg.txt Yuyang Du
2016-08-22 16:48   ` Randy Dunlap
2016-08-10  0:14 ` [PATCH v1 03/10] sched/fair: Add static to remove_entity_load_avg() Yuyang Du
2016-08-10  0:14 ` [PATCH v1 04/10] sched/fair: Rename variable names for sched averages Yuyang Du
2016-08-10  0:14 ` [PATCH v1 05/10] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
2016-08-10  0:14 ` [PATCH v1 06/10] sched/fair: Optimize __update_sched_avg() Yuyang Du
2016-08-10  0:14 ` [PATCH v1 07/10] sched/fair: Remove useless 64-bit to 32-bit variable conversion Yuyang Du
2016-08-25  7:11   ` Vincent Guittot
2016-08-29  1:00     ` Yuyang Du
2016-08-10  0:14 ` [PATCH v1 08/10] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-08-10  0:14 ` [PATCH v1 09/10] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-08-10  0:14 ` [PATCH v1 10/10] sched/fair: Implement flat hierarchical structure for util_avg Yuyang Du
2016-08-10  0:23 ` [PATCH v1 00/10] Optimize sched avgs computation and implement flat util hierarchy Yuyang Du
2016-08-22 23:26   ` Yuyang Du
2016-08-23 13:28     ` Vincent Guittot
2016-08-23 14:13       ` Peter Zijlstra
2016-08-23 14:45         ` Vincent Guittot
2016-08-23 15:39           ` Dietmar Eggemann
2016-08-29  1:37             ` Yuyang Du
2016-09-01 18:32               ` Dietmar Eggemann
2016-08-24  8:54           ` Morten Rasmussen
2016-08-24  9:48             ` Vincent Guittot
2016-08-29 19:00             ` Yuyang Du
2016-09-01 14:22               ` Morten Rasmussen
2016-08-23 19:09         ` Yuyang Du
2016-08-23 19:01       ` Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).