linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages
@ 2016-05-03 20:02 Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Hi Peter,

This patch series combines the previous cleanup and optimization
series. And as you and Ingo suggested, the increased kernel load
scale is reinstated when on 64BIT and FAIR_GROUP_SCHED.

This patch series should have no perceivable changes to load
and util except that load's range is increased by 1024 times.

My initial tests suggest that, see this previous post's link for
the figures: http://article.gmane.org/gmane.linux.kernel/2213506.
The workload is running 100us out of every 200us, and 2000us out
of every 8000us. Again fixed workload, fixed CPU, and fixed freq.

And I believe the codes should be cleaner and more efficient after
these patches.

The changes leading to this version include changelog and code
comment reword according to Peter's comments.

Thanks,
Yuyang

Yuyang Du (12):
  sched/fair: Optimize sum computation with a lookup table
  sched/fair: Rename variable names for sched averages
  sched/fair: Change the variable to hold the number of periods to
    32bit
  sched/fair: Add __always_inline compiler attribute to
    __accumulate_sum()
  sched/fair: Optimize __update_sched_avg()
  documentation: Add scheduler/sched-avg.txt
  sched/fair: Generalize the load/util averages resolution definition
  sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  sched/fair: Add introduction to the sched average metrics
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Enable increased scale for kernel load

 Documentation/scheduler/sched-avg.txt |   94 ++++++++
 include/linux/sched.h                 |   81 ++++++-
 kernel/sched/core.c                   |    8 +-
 kernel/sched/fair.c                   |  400 +++++++++++++++++----------------
 kernel/sched/sched.h                  |   48 ++--
 5 files changed, 398 insertions(+), 233 deletions(-)
 create mode 100644 Documentation/scheduler/sched-avg.txt

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant time.

The program to generate the constants is located at:
Documentation/scheduler/sched-avg.txt

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8a33ab..e803f11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
 };
 
 /*
+ * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
+ * lower integers.
+ */
+static const u32 __accumulated_sum_N32[] = {
+	    0, 23371, 35056, 40899, 43820, 45281,
+	46011, 46376, 46559, 46650, 46696, 46719,
+};
+
+/*
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
@@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
 	else if (unlikely(n >= LOAD_AVG_MAX_N))
 		return LOAD_AVG_MAX;
 
-	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
-	do {
-		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
-		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
-
-		n -= LOAD_AVG_PERIOD;
-	} while (n > LOAD_AVG_PERIOD);
-
+	/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
+	contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
+	n %= LOAD_AVG_PERIOD;
 	contrib = decay_load(contrib, n);
 	return contrib + runnable_avg_yN_sum[n];
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/12] sched/fair: Rename variable names for sched averages
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit Yuyang Du
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

The names of sched averages (including load_avg and util_avg) have
been changed and added in the past a couple of years, some of
the names are a bit confusing especially to people who first read them.
This patch attempts to make the names more self-explaining. And some
comments are updated too.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  209 ++++++++++++++++++++++++++-------------------------
 1 file changed, 107 insertions(+), 102 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e803f11..2b83b4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -660,13 +660,15 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
- * We choose a half-life close to 1 scheduling period.
- * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are
- * dependent on this value.
+ * Note: everything in sched average calculation, including
+ * __decay_inv_multiply_N, __accumulated_sum_N, __accumulated_sum_N32,
+ * SCHED_AVG_MAX, and SCHED_AVG_MAX_N, is dependent on and only on
+ * (1) exponential decay, (2) a period of 1024*1024ns (~1ms), and (3)
+ * a half-life of 32 periods.
  */
-#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
+#define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
+#define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
+#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -681,7 +683,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 */
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+	sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -731,7 +733,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		} else {
 			sa->util_avg = cap;
 		}
-		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+		sa->util_sum = sa->util_avg * SCHED_AVG_MAX;
 	}
 }
 
@@ -1834,7 +1836,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.load_sum / p->se.load.weight;
-		*period = LOAD_AVG_MAX;
+		*period = SCHED_AVG_MAX;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2583,7 +2585,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 #ifdef CONFIG_SMP
 /* Precomputed fixed inverse multiplies for multiplication by y^n */
-static const u32 runnable_avg_yN_inv[] = {
+static const u32 __decay_inv_multiply_N[] = {
 	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
 	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
 	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
@@ -2596,7 +2598,7 @@ static const u32 runnable_avg_yN_inv[] = {
  * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
  * over-estimates when re-combining.
  */
-static const u32 runnable_avg_yN_sum[] = {
+static const u32 __accumulated_sum_N[] = {
 	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
 	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
 	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
@@ -2612,16 +2614,18 @@ static const u32 __accumulated_sum_N32[] = {
 };
 
 /*
- * Approximate:
- *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ * val * y^n, where y^m ~= 0.5
+ *
+ * n is the number of periods past; a period is ~1ms
+ * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
  */
-static __always_inline u64 decay_load(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u64 n)
 {
 	unsigned int local_n;
 
 	if (!n)
 		return val;
-	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+	else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
 		return 0;
 
 	/* after bounds checking we can collapse to 32-bit */
@@ -2634,36 +2638,36 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 	 *
 	 * To achieve constant time decay_load.
 	 */
-	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
-		val >>= local_n / LOAD_AVG_PERIOD;
-		local_n %= LOAD_AVG_PERIOD;
+	if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
+		val >>= local_n / SCHED_AVG_HALFLIFE;
+		local_n %= SCHED_AVG_HALFLIFE;
 	}
 
-	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
 	return val;
 }
 
 /*
- * For updates fully spanning n periods, the contribution to runnable
- * average will be: \Sum 1024*y^n
+ * For updates fully spanning n periods, the accumulated contribution
+ * will be: \Sum 1024*y^n.
  *
- * We can compute this reasonably efficiently by combining:
- *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
+ * We can compute this efficiently by combining:
+ * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __compute_runnable_contrib(u64 n)
+static u32 __accumulate_sum(u64 n)
 {
 	u32 contrib = 0;
 
-	if (likely(n <= LOAD_AVG_PERIOD))
-		return runnable_avg_yN_sum[n];
-	else if (unlikely(n >= LOAD_AVG_MAX_N))
-		return LOAD_AVG_MAX;
+	if (likely(n <= SCHED_AVG_HALFLIFE))
+		return __accumulated_sum_N[n];
+	else if (unlikely(n >= SCHED_AVG_MAX_N))
+		return SCHED_AVG_MAX;
 
-	/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
-	contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
-	n %= LOAD_AVG_PERIOD;
-	contrib = decay_load(contrib, n);
-	return contrib + runnable_avg_yN_sum[n];
+	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
+	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
+	n %= SCHED_AVG_HALFLIFE;
+	contrib = __decay_sum(contrib, n);
+	return contrib + __accumulated_sum_N[n];
 }
 
 #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2673,35 +2677,35 @@ static u32 __compute_runnable_contrib(u64 n)
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series.  To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ * We can represent the historical contribution to sched average as the
+ * coefficients of a geometric series.  To do this we divide the history
+ * into segments of approximately 1ms (1024*1024ns); label the segment that
+ * occurred N-1024us ago p_N, with p_0 corresponding to the current period, e.g.
  *
  * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
  *      p0            p1           p2
  *     (now)       (~1ms ago)  (~2ms ago)
  *
- * Let u_i denote the fraction of p_i that the entity was runnable.
+ * Let u_i denote the fraction of p_i whose state (runnable/running) we count.
  *
  * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
+ * following representation of a sched metric:
  *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
  *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- *   y^32 = 0.5
+ * We choose y based on a half-life of 32 periods (which is ~32ms):
+ *   y^32 = 0.5 => y = (0.5)^(1/32)
  *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
+ * where 32 is the number of periods that a past period's contribution is
+ * halved. This means that the impact of a period every ~32ms ago will be
+ * as much as 50% of the previous value.
  *
  * When a period "rolls over" and we have new u_0`, multiplying the previous
  * sum again by y is sufficient to update:
- *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ *   avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *       = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+__update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta, scaled_delta, periods;
@@ -2762,15 +2766,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		periods = delta / 1024;
 		delta %= 1024;
 
-		sa->load_sum = decay_load(sa->load_sum, periods + 1);
+		sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
 		if (cfs_rq) {
 			cfs_rq->runnable_load_sum =
-				decay_load(cfs_rq->runnable_load_sum, periods + 1);
+				__decay_sum(cfs_rq->runnable_load_sum, periods + 1);
 		}
-		sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
+		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		contrib = __compute_runnable_contrib(periods);
+		contrib = __accumulate_sum(periods);
 		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
@@ -2794,12 +2798,12 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	sa->period_contrib += delta;
 
 	if (decayed) {
-		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+		sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
 		if (cfs_rq) {
 			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+				div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
 		}
-		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 	}
 
 	return decayed;
@@ -2867,8 +2871,8 @@ void set_task_rq_fair(struct sched_entity *se,
 		p_last_update_time = prev->avg.last_update_time;
 		n_last_update_time = next->avg.last_update_time;
 #endif
-		__update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
-				  &se->avg, 0, 0, NULL);
+		__update_sched_avg(p_last_update_time, cpu_of(rq_of(prev)),
+				   &se->avg, 0, 0, NULL);
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
@@ -2909,7 +2913,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
+update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed, removed_load = 0, removed_util = 0;
@@ -2917,18 +2921,18 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sa->load_sum = max_t(s64, sa->load_sum - r * SCHED_AVG_MAX, 0);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sa->util_sum = max_t(s32, sa->util_sum - r * SCHED_AVG_MAX, 0);
 		removed_util = 1;
 	}
 
-	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+	decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
 		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
@@ -2943,7 +2947,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 }
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_sched_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -2954,15 +2958,15 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	__update_sched_avg(now, cpu, &se->avg,
+			   se->on_rq * scale_load_down(se->load.weight),
+			   cfs_rq->curr == se, NULL);
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
+	if (update_cfs_rq_sched_avg(now, cfs_rq, true) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
 }
 
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (!sched_feat(ATTACH_AGE_LOAD))
 		goto skip_aging;
@@ -2972,8 +2976,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * have aged the average right before clearing @last_update_time.
 	 */
 	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
+		__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+				   &se->avg, 0, 0, NULL);
 
 		/*
 		 * XXX: we could have just aged the entire load away if we've been
@@ -2991,11 +2995,11 @@ skip_aging:
 	cfs_rq_util_change(cfs_rq);
 }
 
-static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+			   &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			   cfs_rq->curr == se, NULL);
 
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
 	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
@@ -3007,7 +3011,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 /* Add the load generated by se into cfs_rq's load average */
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3015,18 +3019,18 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	migrated = !sa->last_update_time;
 	if (!migrated) {
-		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
-			cfs_rq->curr == se, NULL);
+		__update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+				   se->on_rq * scale_load_down(se->load.weight),
+				   cfs_rq->curr == se, NULL);
 	}
 
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, !migrated);
+	decayed = update_cfs_rq_sched_avg(now, cfs_rq, !migrated);
 
 	cfs_rq->runnable_load_avg += sa->load_avg;
 	cfs_rq->runnable_load_sum += sa->load_sum;
 
 	if (migrated)
-		attach_entity_load_avg(cfs_rq, se);
+		attach_entity_sched_avg(cfs_rq, se);
 
 	if (decayed || migrated)
 		update_tg_load_avg(cfs_rq, 0);
@@ -3034,9 +3038,9 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
+	update_sched_avg(se, 1);
 
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
@@ -3069,7 +3073,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
  */
-void remove_entity_load_avg(struct sched_entity *se)
+void remove_entity_sched_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 last_update_time;
@@ -3083,7 +3087,8 @@ void remove_entity_load_avg(struct sched_entity *se)
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
 
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+	__update_sched_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
+			   &se->avg, 0, 0, NULL);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
 	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
 }
@@ -3102,17 +3107,17 @@ static int idle_balance(struct rq *this_rq);
 
 #else /* CONFIG_SMP */
 
-static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void update_sched_avg(struct sched_entity *se, int update_tg) {}
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void remove_entity_load_avg(struct sched_entity *se) {}
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void remove_entity_sched_avg(struct sched_entity *se) {}
 
 static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 
 static inline int idle_balance(struct rq *rq)
 {
@@ -3272,7 +3277,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_entity_sched_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -3351,7 +3356,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_entity_sched_avg(cfs_rq, se);
 
 	if (schedstat_enabled())
 		update_stats_dequeue(cfs_rq, se, flags);
@@ -3431,7 +3436,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (schedstat_enabled())
 			update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3535,7 +3540,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_sched_avg(prev, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3551,7 +3556,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_sched_avg(curr, 1);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4424,7 +4429,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4484,7 +4489,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -5370,7 +5375,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
 	 * will result in the wakee task is less decayed, but giving the wakee more
 	 * load sounds not bad.
 	 */
-	remove_entity_load_avg(&p->se);
+	remove_entity_sched_avg(&p->se);
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;
@@ -5381,7 +5386,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
 
 static void task_dead_fair(struct task_struct *p)
 {
-	remove_entity_load_avg(&p->se);
+	remove_entity_sched_avg(&p->se);
 }
 #endif /* CONFIG_SMP */
 
@@ -6262,7 +6267,7 @@ static void update_blocked_averages(int cpu)
 		if (throttled_hierarchy(cfs_rq))
 			continue;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
+		if (update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -6323,7 +6328,7 @@ static inline void update_blocked_averages(int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 	update_rq_clock(rq);
-	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
+	update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
@@ -8357,7 +8362,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	detach_entity_load_avg(cfs_rq, se);
+	detach_entity_sched_avg(cfs_rq, se);
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -8374,7 +8379,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 #endif
 
 	/* Synchronize task with its cfs_rq */
-	attach_entity_load_avg(cfs_rq, se);
+	attach_entity_sched_avg(cfs_rq, se);
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
@@ -8513,7 +8518,7 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 	for_each_possible_cpu(cpu) {
 		if (tg->se[cpu])
-			remove_entity_load_avg(tg->se[cpu]);
+			remove_entity_sched_avg(tg->se[cpu]);
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-05 11:13   ` Morten Rasmussen
  2016-05-03 20:02 ` [PATCH v3 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

In sched average update, a period is about 1ms, so a 32-bit unsigned
integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
days.

For usual cases, 32bit is big enough and 64bit is needless. But if
a task sleeps longer than it, there can be two outcomes:

Consider a task sleeps whatever m milliseconds, then n = (u32)m.

1. If n >= 32*64, then the task's sched avgs will be surely decayed
   to 0. In this case, it really doesn't matter that the 32bit is not
   big enough to hold m. In other words, a task sleeps 2 secs or sleeps
   50 days are the same from sched average point of view.

2. If m < 32*64, the chance to be here is very very low, but if so,
   the task's sched avgs MAY NOT be decayed to 0, depending on how
   big its sums are, and the chance to 0 is still good as load_sum
   is way less than ~0ULL and util_sum way less than ~0U.

Nevertheless, what really maters is what happens in the worst-case
scenario, which is when (u32)m = 0? So in that case, it would be like
after so long a sleep, we treat the task as it never slept, and has the
same sched averages as before it slept. That is actually not bad or
nothing to worry about, and think of it as the same as how we treat
a new born task.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b83b4c..34ccaa3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2619,18 +2619,13 @@ static const u32 __accumulated_sum_N32[] = {
  * n is the number of periods past; a period is ~1ms
  * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
  */
-static __always_inline u64 __decay_sum(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u32 n)
 {
-	unsigned int local_n;
-
 	if (!n)
 		return val;
 	else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
 		return 0;
 
-	/* after bounds checking we can collapse to 32-bit */
-	local_n = n;
-
 	/*
 	 * As y^PERIOD = 1/2, we can combine
 	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
@@ -2638,12 +2633,12 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
 	 *
 	 * To achieve constant time decay_load.
 	 */
-	if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
-		val >>= local_n / SCHED_AVG_HALFLIFE;
-		local_n %= SCHED_AVG_HALFLIFE;
+	if (unlikely(n >= SCHED_AVG_HALFLIFE)) {
+		val >>= n / SCHED_AVG_HALFLIFE;
+		n %= SCHED_AVG_HALFLIFE;
 	}
 
-	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
+	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
 	return val;
 }
 
@@ -2654,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __accumulate_sum(u64 n)
+static u32 __accumulate_sum(u32 n)
 {
 	u32 contrib = 0;
 
@@ -2708,8 +2703,8 @@ static __always_inline int
 __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
-	u32 contrib;
+	u64 delta, scaled_delta;
+	u32 contrib, periods;
 	unsigned int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq, scale_cpu;
 
@@ -2762,7 +2757,11 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		delta -= delta_w;
 
-		/* Figure out how many additional periods this update spans */
+		/*
+		 * Figure out how many additional periods this update spans.
+		 * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
+		 * approximately a maximum of 49 (=2^32/1000/3600/24) days.
+		 */
 		periods = delta / 1024;
 		delta %= 1024;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum()
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (2 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Everybody has it. If code-size is not the problem, __accumulate_sum()
should have it too.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34ccaa3..a060ef2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2649,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __accumulate_sum(u32 n)
+static __always_inline u32 __accumulate_sum(u32 n)
 {
 	u32 contrib = 0;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (3 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-10  8:46   ` Morten Rasmussen
  2016-05-03 20:02 ` [PATCH v3 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

__update_sched_avg() has these steps:

  1. add the remainder of the last incomplete period
  2. decay old sum
  3. accumulate new sum in full periods since last_update_time
  4. add the current incomplete period
  5. update averages

Previously, we separately computed steps 1, 3, and 4, leading to
each one of them ugly in codes and costly in overhead.

For example:

             c1          c3           c4
             ^           ^            ^
             |           |            |
           |<->|<----------------->|<--->|
   ... |---x---|------| ... |------|-----x (now)

c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
in timing aspect of steps 1, 3, and 4 respectively.

Then the accumulated contribution to load_sum, for example, is:

contrib = c1 * weight * freq_scaled;
contrib += c3 * weight * freq_scaled;
contrib += c4 * weight * freq_scaled;

Obviously, we can optimize the above as:

contrib = c1 + c3 + c4;
contrib *= weight * freq_scaled;

After we combine them together, we will have much cleaner codes
and less CPU cycles.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  189 ++++++++++++++++++++++++++-------------------------
 1 file changed, 95 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a060ef2..495e5f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
  */
 #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
 #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
-#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
+#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
 
 /*
  * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
- * lower integers.
+ * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
  */
 static const u32 __accumulated_sum_N32[] = {
 	    0, 23371, 35056, 40899, 43820, 45281,
@@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
 /*
  * val * y^n, where y^m ~= 0.5
  *
- * n is the number of periods past; a period is ~1ms
+ * n is the number of periods past. A period is ~1ms, so a 32bit
+ * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
+ *
  * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
+ *
  */
 static __always_inline u64 __decay_sum(u64 val, u32 n)
 {
@@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static __always_inline u32 __accumulate_sum(u32 n)
+static __always_inline u32
+__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
 {
-	u32 contrib = 0;
+	u32 contrib;
+
+	if (!periods)
+		return remainder - period_contrib;
 
-	if (likely(n <= SCHED_AVG_HALFLIFE))
-		return __accumulated_sum_N[n];
-	else if (unlikely(n >= SCHED_AVG_MAX_N))
+	if (unlikely(periods >= SCHED_AVG_MAX_N))
 		return SCHED_AVG_MAX;
 
-	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
-	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
-	n %= SCHED_AVG_HALFLIFE;
-	contrib = __decay_sum(contrib, n);
-	return contrib + __accumulated_sum_N[n];
+	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
+
+	periods -= 1;
+	if (likely(periods <= SCHED_AVG_HALFLIFE))
+		contrib = __accumulated_sum_N[periods];
+	else {
+		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
+		periods %= SCHED_AVG_HALFLIFE;
+		contrib = __decay_sum(contrib, periods);
+		contrib += __accumulated_sum_N[periods];
+	}
+
+	return contrib + remainder;
 }
 
 #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
 
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
+static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
+	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
+{
+	u32 contrib, periods;
+	unsigned long scale_freq, scale_cpu;
+
+	scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+	delta += sa->period_contrib;
+	periods = delta >> 10; /* A period is 1024us (~1ms) */
+
+	/*
+	 * Accumulating *_sum has two steps.
+	 *
+	 * Step 1: decay old *_sum if we crossed period boundaries.
+	 */
+	if (periods) {
+		sa->load_sum = __decay_sum(sa->load_sum, periods);
+		if (cfs_rq) {
+			cfs_rq->runnable_load_sum =
+				__decay_sum(cfs_rq->runnable_load_sum, periods);
+		}
+		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
+	}
+
+	/*
+	 * Step 2: accumulate new *_sum since last_update_time. This at most has
+	 * three parts (at least one part): (1) remainder of incomplete last
+	 * period, (2) full periods since (1), and (3) incomplete current period.
+	 *
+	 * Fortunately, we can (and should) do all these three at once.
+	 */
+	delta %= 1024;
+	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
+	sa->period_contrib = delta;
+
+	contrib = cap_scale(contrib, scale_freq);
+	if (weight) {
+		sa->load_sum += weight * contrib;
+		if (cfs_rq)
+			cfs_rq->runnable_load_sum += weight * contrib;
+	}
+	if (running)
+		sa->util_sum += contrib * scale_cpu;
+
+	return periods;
+}
+
 /*
  * We can represent the historical contribution to sched average as the
  * coefficients of a geometric series.  To do this we divide the history
@@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
  */
 static __always_inline int
 __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
+		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta;
-	u32 contrib, periods;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	u64 delta;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2727,85 +2786,27 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
-	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->period_contrib;
-	if (delta + delta_w >= 1024) {
-		decayed = 1;
-
-		/* how much left for next period will start over, we don't know yet */
-		sa->period_contrib = 0;
-
-		/*
-		 * Now that we know we're crossing a period boundary, figure
-		 * out how much from delta we need to complete the current
-		 * period and accrue it.
-		 */
-		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
-		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
-			if (cfs_rq) {
-				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
-			}
-		}
-		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
-
-		delta -= delta_w;
-
-		/*
-		 * Figure out how many additional periods this update spans.
-		 * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
-		 * approximately a maximum of 49 (=2^32/1000/3600/24) days.
-		 */
-		periods = delta / 1024;
-		delta %= 1024;
-
-		sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_sum =
-				__decay_sum(cfs_rq->runnable_load_sum, periods + 1);
-		}
-		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
-
-		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		contrib = __accumulate_sum(periods);
-		contrib = cap_scale(contrib, scale_freq);
-		if (weight) {
-			sa->load_sum += weight * contrib;
-			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
-		}
-		if (running)
-			sa->util_sum += contrib * scale_cpu;
-	}
-
-	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
-	if (weight) {
-		sa->load_sum += weight * scaled_delta;
-		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
-	}
-	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
-
-	sa->period_contrib += delta;
+	/*
+	 * Now we know we crossed measurement unit boundaries. The *_avg
+	 * accrues by two steps:
+	 *
+	 * Step 1: accumulate *_sum since last_update_time. If we haven't
+	 * crossed period boundaries, finish.
+	 */
+	if (!accumulate_sum(delta, sa, cfs_rq, cpu, weight, running))
+		return 0;
 
-	if (decayed) {
-		sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
-		}
-		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
+	/*
+	 * Step 2: update *_avg.
+	 */
+	sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
+	if (cfs_rq) {
+		cfs_rq->runnable_load_avg =
+			div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
 	}
+	sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 
-	return decayed;
+	return 1;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/12] documentation: Add scheduler/sched-avg.txt
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (4 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

This doc file has the program to generate the constants to compute
sched averages.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 Documentation/scheduler/sched-avg.txt |   94 +++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 Documentation/scheduler/sched-avg.txt

diff --git a/Documentation/scheduler/sched-avg.txt b/Documentation/scheduler/sched-avg.txt
new file mode 100644
index 0000000..45be4bd
--- /dev/null
+++ b/Documentation/scheduler/sched-avg.txt
@@ -0,0 +1,94 @@
+The following program is used to generate the constants for
+computing sched averages.
+
+==============================================================
+		C program (compile with -lm)
+==============================================================
+
+#include <math.h>
+#include <stdio.h>
+
+#define HALFLIFE 32
+#define SHIFT 32
+
+double y;
+
+void calc_decay_inv_multiply() {
+	int i;
+	unsigned int x;
+
+	printf("static const u32 __decay_inv_multiply_N[] = {");
+	for(i = 0; i < HALFLIFE; i++) {
+		x = ((1UL<<32)-1)*pow(y, i);
+
+		if (i % 6 == 0) printf("\n\t");
+		printf("0x%8x, ", x);
+	}
+	printf("\n};\n\n");
+}
+
+int sum = 1024;
+void calc_accumulated_sum() {
+	int i;
+
+	printf("static const u32 __accumulated_sum_N[] = {\n\t    0,");
+	for(i = 1; i <= HALFLIFE; i++) {
+		if (i == 1)
+			sum *= y;
+		else
+			sum = sum*y + 1024*y;
+
+		if (i % 11 == 0) printf("\n\t");
+		printf("%5d,", sum);
+	}
+	printf("\n};\n\n");
+}
+
+int n = 1;
+/* first period */
+long max = 1024;
+
+void calc_converged_max() {
+	long last = 0, y_inv = ((1UL<<32)-1)*y;
+
+	for (; ; n++) {
+		if (n > 1)
+			max = ((max*y_inv)>>SHIFT) + 1024;
+			/*
+			 * This is the same as:
+			 * max = max*y + 1024;
+			 */
+
+		if (last == max)
+			break;
+
+		last = max;
+	}
+	n--;
+	printf("#define SCHED_AVG_HALFLIFE %d\n", HALFLIFE);
+	printf("#define SCHED_AVG_MAX %ld\n", max);
+	printf("#define SCHED_AVG_MAX_N %d\n\n", n);
+}
+
+void calc_accumulated_sum_32() {
+	int i, x = sum;
+
+	printf("static const u32 __accumulated_sum_N32[] = {\n\t     0,");
+	for(i = 1; i <= n/HALFLIFE+1; i++) {
+		if (i > 1)
+			x = x/2 + sum;
+
+		if (i % 6 == 0) printf("\n\t");
+		printf("%6d,", x);
+	}
+	printf("\n};\n\n");
+}
+
+void main() {
+	y = pow(0.5, 1/(double)HALFLIFE);
+
+	calc_decay_inv_multiply();
+	calc_accumulated_sum();
+	calc_converged_max();
+	calc_accumulated_sum_32();
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/12] sched/fair: Generalize the load/util averages resolution definition
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (5 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, including weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges.

In order to avoid errors relating to the fixed point range of these
metrics, we define a basic fixed point range, and then formalize all
metrics to base on the basic range.

The basic range is 1024. Further, one can recursively apply this basic
range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   16 +++++++++++++---
 kernel/sched/fair.c   |    4 ----
 kernel/sched/sched.h  |   15 ++++++++++-----
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad9454d..33e7929 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,9 +937,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT	10
+# define SCHED_FIXEDPOINT_SCALE	(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1205,8 +1215,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 495e5f0..1a61137 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2678,10 +2678,6 @@ __accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
 	return contrib + remainder;
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69da6fc..996a137 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (6 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-05  7:46   ` [PATCH v4] " Ingo Molnar
  2016-05-03 20:02 ` [PATCH v3 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c  |    4 ++--
 kernel/sched/sched.h |   22 +++++++++++-----------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a61137..017a26a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -721,7 +721,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
-	long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -7019,7 +7019,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
+				      scale_load_down(NICE_0_LOAD);
 		if (load_above_capacity > busiest->group_capacity)
 			load_above_capacity -= busiest->group_capacity;
 		else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 996a137..1a3be6f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOAD		SCHED_LOAD_SCALE
-#define NICE_0_SHIFT		SCHED_LOAD_SHIFT
+#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -861,7 +861,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
 	atomic_t ref;
 	/*
-	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
 	unsigned int capacity;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/12] sched/fair: Add introduction to the sched average metrics
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (7 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

These sched metrics have become complex enough. We introduce them
at their definitions.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   60 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 33e7929..a7cddd6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,18 +1211,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_sched_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *    Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
 	u64 last_update_time, load_sum;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 10/12] sched/fair: Remove scale_load_down() for load_avg
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (8 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher fixed point range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased fixed point range. In this case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of the day, on the condition where it occurs, the
load average will not be useful anyway. And afterwards if the machine
can survive, the load will correct itself very quickly in no more than
~2 seconds (=32ms*64).

[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   19 ++++++++++++++-----
 kernel/sched/fair.c   |   19 +++++++++----------
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7cddd6..b718cb0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1216,7 +1216,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1224,7 +1224,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalize to the highest frequency
  *
@@ -1250,9 +1250,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32bit kernel), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 017a26a..4487c2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * will definitely be update (after enqueue).
 	 */
 	sa->period_contrib = 1023;
-	sa->load_avg = scale_load_down(se->load.weight);
+	sa->load_avg = se->load.weight;
 	sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
@@ -2929,7 +2929,7 @@ update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	}
 
 	decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+				     cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -2954,8 +2954,7 @@ static inline void update_sched_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_sched_avg(now, cpu, &se->avg,
-			   se->on_rq * scale_load_down(se->load.weight),
+	__update_sched_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	if (update_cfs_rq_sched_avg(now, cfs_rq, true) && update_tg)
@@ -2994,7 +2993,7 @@ skip_aging:
 static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			   &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			   &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
@@ -3016,7 +3015,7 @@ enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	migrated = !sa->last_update_time;
 	if (!migrated) {
 		__update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-				   se->on_rq * scale_load_down(se->load.weight),
+				   se->on_rq * se->load.weight,
 				   cfs_rq->curr == se, NULL);
 	}
 
@@ -7018,10 +7017,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-				      scale_load_down(NICE_0_LOAD);
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
+		if (load_above_capacity > scale_load(busiest->group_capacity))
+			load_above_capacity -=
+				scale_load(busiest->group_capacity);
 		else
 			load_above_capacity = ~0UL;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/12] sched/fair: Rename scale_load() and scale_load_down()
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (9 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  2016-05-03 20:02 ` [PATCH v3 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively. This helps us tag them
clearly and avoid confusion.

[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    8 ++++----
 kernel/sched/fair.c  |   18 ++++++++++++------
 kernel/sched/sched.h |   16 ++++++++--------
 3 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c82ca6e..349d776 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -698,12 +698,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (idle_policy(p->policy)) {
-		load->weight = scale_load(WEIGHT_IDLEPRIO);
+		load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
+	load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8184,7 +8184,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	return sched_group_set_shares(css_tg(css), user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8192,7 +8192,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4487c2a..200f752 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
 	if (likely(lw->inv_weight))
 		return;
 
-	w = scale_load_down(lw->weight);
+	w = kernel_to_user_load(lw->weight);
 
 	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 		lw->inv_weight = 1;
@@ -210,10 +210,14 @@ static void __update_inv_weight(struct load_weight *lw)
  *
  * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
  * weight/lw.weight <= 1, and therefore our shift will also be positive.
+ *
+ * Note load.weight falls back to user load scale (i.e., NICE_0's load is
+ * 1024), instead of possibly increased kernel load scale (i.e., NICE_0's
+ * load is NICE_0_LOAD) due to multiplication and division efficiency.
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
 {
-	u64 fact = scale_load_down(weight);
+	u64 fact = kernel_to_user_load(weight);
 	int shift = WMULT_SHIFT;
 
 	__update_inv_weight(lw);
@@ -7017,10 +7021,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
+		unsigned long min_cpu_load =
+			busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE;
 		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
-		if (load_above_capacity > scale_load(busiest->group_capacity))
-			load_above_capacity -=
-				scale_load(busiest->group_capacity);
+		if (load_above_capacity > min_cpu_load)
+			load_above_capacity -= min_cpu_load;
 		else
 			load_above_capacity = ~0UL;
 	}
@@ -8574,7 +8579,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+		       user_to_kernel_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a3be6f..871da67 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		(w)
-# define scale_load_down(w)	(w)
+# define user_to_kernel_load(w)	(w)
+# define kernel_to_user_load(w)	(w)
 #endif
 
 /*
  * Task weight (visible to user) and its load (invisible to user) have
  * independent resolution, but they should be well calibrated. We use
- * scale_load() and scale_load_down(w) to convert between them. The
- * following must be true:
- *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them. The following must be true:
  *
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
 #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/12] sched/fair: Enable increased scale for kernel load
  2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (10 preceding siblings ...)
  2016-05-03 20:02 ` [PATCH v3 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
@ 2016-05-03 20:02 ` Yuyang Du
  11 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-03 20:02 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

The increased scale or precision for kernel load has been disabled
since the commit e4c2fb0d5776 ("sched: Disable (revert) SCHED_LOAD_SCALE
increase"). But we do need it when we have task groups, especially on
bigger machines. Otherwise, we probably will run out of precision for
load distribution.

So, we reinstate it and resolve to fix whatsoever power regression may
be seen.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/sched.h |   51 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 871da67..5f66a2c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,37 +42,36 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
 #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
+ * Task weight (visible and set by user) and its load (invisible to user)
+ * can have independent ranges. We increase the scale of load for 64-bit
+ * architectures. The extra precision improves share distribution and
+ * load balancing of low-weight task groups (e.g., nice +19 on an autogroup),
+ * deeper taskgroup hierarchies, especially on larger systems. This is not
+ * a user-visible change and does not change the user-interface for setting
+ * shares/weights. We increase resolution only if we have enough bits to allow
+ * this increased precision (i.e., BITS_PER_LONG > 32). The costs for increasing
+ * resolution when BITS_PER_LONG <= 32 are pretty high and the returns do not
+ * justify the increased costs.
  *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	(w)
-# define kernel_to_user_load(w)	(w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
+ * Therefore, the user load and kernel should be well expressed to make them
+ * easily exchanged. We use user_to_kernel_load() and kernel_to_user_load(w)
+ * to convert between them.
  *
+ * Following equations are a simple illustration of their relationship:
  * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
  * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
-#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
+#if defined(CONFIG_64BIT) && defined(CONFIG_FAIR_GROUP_SCHED)
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w)	(w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w)
+#define kernel_to_user_load(w)	(w)
+#endif
+
+#define NICE_0_LOAD		(1UL << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  2016-05-03 20:02 ` [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
@ 2016-05-05  7:46   ` Ingo Molnar
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2016-05-05  7:46 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, juri.lelli


* Yuyang Du <yuyang.du@intel.com> wrote:

> After cleaning up the sched metrics, these two definitions that cause
> ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
> instead (the names suggest clearly who they are).
> 
> Suggested-by: Ben Segall <bsegall@google.com>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>

Yeah, so this patch was a bit of a trainwreck:

 - it didn't build on 32-bit kernels

 - a stale SCHED_LOAD_SHIFT definition was left around

 - the title and th changelog lies actively: it's not a removal, but a complex
   combination of a rename, replace and removal ...

I've fixed that all with the patch below, but _please_ be more careful in the 
future when changing scheduler code, and please also read your changelogs and 
patch titles before sending them out to make sure the label matches contents.

I'll push it all out in tip:sched/core if it passes testing.

Thanks,

	Ingo

=======================>
>From 172895e6b5216eba3e0880460829a8baeefd55f3 Mon Sep 17 00:00:00 2001
From: Yuyang Du <yuyang.du@intel.com>
Date: Tue, 5 Apr 2016 12:12:27 +0800
Subject: [PATCH] sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE

After cleaning up the sched metrics, there are two definitions that are
ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT.

Resolve this:

 - Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what
   it is.

 - Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE.

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com
[ Rewrote the changelog and fixed the build on 32-bit kernels. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 22 +++++++++++-----------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76ca86e9fc20..e1485710d1ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -719,7 +719,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
-	long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -7010,7 +7010,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
+				      scale_load_down(NICE_0_LOAD);
 		if (load_above_capacity > busiest->group_capacity)
 			load_above_capacity -= busiest->group_capacity;
 		else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ad83361f9e67..d24e91b0a722 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -56,25 +56,25 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increase coverage and consistency always enable it on 64bit platforms.
  */
 #ifdef CONFIG_64BIT
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to users) and its load (invisible to users) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, and the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to users) and its load (invisible to users) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOAD		SCHED_LOAD_SCALE
-#define NICE_0_SHIFT		SCHED_LOAD_SHIFT
+#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -863,7 +863,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
 	atomic_t ref;
 	/*
-	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
 	unsigned int capacity;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
  2016-05-03 20:02 ` [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit Yuyang Du
@ 2016-05-05 11:13   ` Morten Rasmussen
  2016-05-05 11:23     ` Vincent Guittot
  2016-05-05 18:19     ` Yuyang Du
  0 siblings, 2 replies; 21+ messages in thread
From: Morten Rasmussen @ 2016-05-05 11:13 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, vincent.guittot,
	dietmar.eggemann, juri.lelli

On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:
> In sched average update, a period is about 1ms, so a 32-bit unsigned
> integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
> days.
> 
> For usual cases, 32bit is big enough and 64bit is needless. But if
> a task sleeps longer than it, there can be two outcomes:
> 
> Consider a task sleeps whatever m milliseconds, then n = (u32)m.

Maybe it crystal clear that you mean: For any m > UINT_MAX?

> 
> 1. If n >= 32*64, then the task's sched avgs will be surely decayed
>    to 0. In this case, it really doesn't matter that the 32bit is not
>    big enough to hold m. In other words, a task sleeps 2 secs or sleeps
>    50 days are the same from sched average point of view.
> 
> 2. If m < 32*64, the chance to be here is very very low, but if so,

Should that be: n < 32*64 ?

Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
a patch ready to post for that:

>From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Mon, 11 Apr 2016 15:41:37 +0100
Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()

In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
with returning zero for n >= LOAD_AVG_MAX_N when decaying in
decay_load() as well instead of only returning zero for n >
LOAD_AVG_PERIOD * 63 (=2016).

As both conditions are unlikely() the impact is quite likely very small,
but at least it makes the rounding errors for load accumulation and
decay symmetrical.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b..42515b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2527,7 +2527,7 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 
 	if (!n)
 		return val;
-	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+	else if (unlikely(n > LOAD_AVG_MAX_N))
 		return 0;
 
 	/* after bounds checking we can collapse to 32-bit */
-- 
1.9.1


>    the task's sched avgs MAY NOT be decayed to 0, depending on how
>    big its sums are, and the chance to 0 is still good as load_sum
>    is way less than ~0ULL and util_sum way less than ~0U.

I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
Whether you get to zero depends on the sums (as you say) and the actual
value of 'n'. It is true that you might get to zero even if n <
LOAD_AVG_MAX_N if the sums are small.

> Nevertheless, what really maters is what happens in the worst-case
> scenario, which is when (u32)m = 0? So in that case, it would be like
> after so long a sleep, we treat the task as it never slept, and has the
> same sched averages as before it slept. That is actually not bad or
> nothing to worry about, and think of it as the same as how we treat
> a new born task.

There is subtle but important difference between not decaying a task
correctly and adding new task: The sleeping task is already accounted
for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
to cfs_rq.avg has been decayed correctly in the mean time which means
that by not guaranteeing a decay of the se.avg at wake-up you introduce
a discrepancy between the task's owen view of its contribution (se.avg)
and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
is migrated at wake-up as we remove se.avg amount of contribution from
the previous cpu's cfs_rq.avg. In other words, you remove too much from
the cfs_rq.avg.

The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
might be acceptable, but it is not a totally harmless change IMHO.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
  2016-05-05 11:13   ` Morten Rasmussen
@ 2016-05-05 11:23     ` Vincent Guittot
  2016-05-05 18:19     ` Yuyang Du
  1 sibling, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-05-05 11:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Dietmar Eggemann, Juri Lelli

Hi Morten,

On 5 May 2016 at 13:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:
>> In sched average update, a period is about 1ms, so a 32-bit unsigned
>> integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
>> days.
>>
>> For usual cases, 32bit is big enough and 64bit is needless. But if
>> a task sleeps longer than it, there can be two outcomes:
>>
>> Consider a task sleeps whatever m milliseconds, then n = (u32)m.
>
> Maybe it crystal clear that you mean: For any m > UINT_MAX?
>
>>
>> 1. If n >= 32*64, then the task's sched avgs will be surely decayed
>>    to 0. In this case, it really doesn't matter that the 32bit is not
>>    big enough to hold m. In other words, a task sleeps 2 secs or sleeps
>>    50 days are the same from sched average point of view.
>>
>> 2. If m < 32*64, the chance to be here is very very low, but if so,
>
> Should that be: n < 32*64 ?
>
> Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
> a patch ready to post for that:
>
> From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> Date: Mon, 11 Apr 2016 15:41:37 +0100
> Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()
>
> In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
> when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
> with returning zero for n >= LOAD_AVG_MAX_N when decaying in
> decay_load() as well instead of only returning zero for n >
> LOAD_AVG_PERIOD * 63 (=2016).
>
> As both conditions are unlikely() the impact is quite likely very small,
> but at least it makes the rounding errors for load accumulation and
> decay symmetrical.
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 56b7d4b..42515b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2527,7 +2527,7 @@ static __always_inline u64 decay_load(u64 val, u64 n)
>
>         if (!n)
>                 return val;
> -       else if (unlikely(n > LOAD_AVG_PERIOD * 63))
> +       else if (unlikely(n > LOAD_AVG_MAX_N))

I had the same question about this 63 and IIUC, the 63 comes from 64bits-1.
The load can be that high that the 64bits are used. We shift right
every LOAD_AVG_PERIOD so we will be sure that the value will be 0
after shifting all 64 bits

>                 return 0;
>
>         /* after bounds checking we can collapse to 32-bit */
> --
> 1.9.1
>
>
>>    the task's sched avgs MAY NOT be decayed to 0, depending on how
>>    big its sums are, and the chance to 0 is still good as load_sum
>>    is way less than ~0ULL and util_sum way less than ~0U.
>
> I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
> Whether you get to zero depends on the sums (as you say) and the actual
> value of 'n'. It is true that you might get to zero even if n <
> LOAD_AVG_MAX_N if the sums are small.
>
>> Nevertheless, what really maters is what happens in the worst-case
>> scenario, which is when (u32)m = 0? So in that case, it would be like
>> after so long a sleep, we treat the task as it never slept, and has the
>> same sched averages as before it slept. That is actually not bad or
>> nothing to worry about, and think of it as the same as how we treat
>> a new born task.
>
> There is subtle but important difference between not decaying a task
> correctly and adding new task: The sleeping task is already accounted
> for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
> to cfs_rq.avg has been decayed correctly in the mean time which means
> that by not guaranteeing a decay of the se.avg at wake-up you introduce
> a discrepancy between the task's owen view of its contribution (se.avg)
> and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
> is migrated at wake-up as we remove se.avg amount of contribution from
> the previous cpu's cfs_rq.avg. In other words, you remove too much from
> the cfs_rq.avg.
>
> The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
> might be acceptable, but it is not a totally harmless change IMHO.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
  2016-05-05 11:13   ` Morten Rasmussen
  2016-05-05 11:23     ` Vincent Guittot
@ 2016-05-05 18:19     ` Yuyang Du
  2016-05-10  9:10       ` Morten Rasmussen
  1 sibling, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-05-05 18:19 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, vincent.guittot,
	dietmar.eggemann, juri.lelli

Hi Morten,

On Thu, May 05, 2016 at 12:13:10PM +0100, Morten Rasmussen wrote:
> On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:
> > In sched average update, a period is about 1ms, so a 32-bit unsigned
> > integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
> > days.
> > 
> > For usual cases, 32bit is big enough and 64bit is needless. But if
> > a task sleeps longer than it, there can be two outcomes:
> > 
> > Consider a task sleeps whatever m milliseconds, then n = (u32)m.
> 
> Maybe it crystal clear that you mean: For any m > UINT_MAX?

En, literally whatever.

> > 1. If n >= 32*64, then the task's sched avgs will be surely decayed
> >    to 0. In this case, it really doesn't matter that the 32bit is not
> >    big enough to hold m. In other words, a task sleeps 2 secs or sleeps
> >    50 days are the same from sched average point of view.
> > 
> > 2. If m < 32*64, the chance to be here is very very low, but if so,
> 
> Should that be: n < 32*64 ?
> 
> Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
> a patch ready to post for that:
>
> >From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> Date: Mon, 11 Apr 2016 15:41:37 +0100
> Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()
> 
> In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
> when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
> with returning zero for n >= LOAD_AVG_MAX_N when decaying in
> decay_load() as well instead of only returning zero for n >
> LOAD_AVG_PERIOD * 63 (=2016).

So basically, you want to add another rule in addition to the exponential
decay rule.
 
> >    the task's sched avgs MAY NOT be decayed to 0, depending on how
> >    big its sums are, and the chance to 0 is still good as load_sum
> >    is way less than ~0ULL and util_sum way less than ~0U.
> 
> I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
> Whether you get to zero depends on the sums (as you say) and the actual
> value of 'n'. It is true that you might get to zero even if n <
> LOAD_AVG_MAX_N if the sums are small.
 
Frankly, util Ben brought it up, I didn't think a task sleeping so long
is even possible. But I do admit it may happen.

However, I will say this. A task sleeping so long is already very rare,
and among all those long sleep cases, the chance that after waking up the
avgs will not be decayed to zero is much less than 0.5 in a million
(=32*64/2^32=1/2^21), assuming the sleeping time is uniformly distributed.

> > Nevertheless, what really maters is what happens in the worst-case
> > scenario, which is when (u32)m = 0? So in that case, it would be like
> > after so long a sleep, we treat the task as it never slept, and has the
> > same sched averages as before it slept. That is actually not bad or
> > nothing to worry about, and think of it as the same as how we treat
> > a new born task.
> 
> There is subtle but important difference between not decaying a task
> correctly and adding new task: The sleeping task is already accounted
> for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
> to cfs_rq.avg has been decayed correctly in the mean time which means
> that by not guaranteeing a decay of the se.avg at wake-up you introduce
> a discrepancy between the task's owen view of its contribution (se.avg)
> and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
> is migrated at wake-up as we remove se.avg amount of contribution from
> the previous cpu's cfs_rq.avg. In other words, you remove too much from
> the cfs_rq.avg.
> 
> The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
> might be acceptable, but it is not a totally harmless change IMHO.

That is just an explanation, :) nevermind, or I really don't think that is
a deal.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
  2016-05-10  9:10       ` Morten Rasmussen
@ 2016-05-10  2:05         ` Yuyang Du
  0 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-10  2:05 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, vincent.guittot,
	dietmar.eggemann, juri.lelli

On Tue, May 10, 2016 at 10:10:21AM +0100, Morten Rasmussen wrote:

> > > > 2. If m < 32*64, the chance to be here is very very low, but if so,
> > > 
> > > Should that be: n < 32*64 ?

Sorry, I overlooked this comment. Yes, it should be n < 32*64.

> > > 
> > > Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
> > > a patch ready to post for that:
> > >
> > > >From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
> > > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > > Date: Mon, 11 Apr 2016 15:41:37 +0100
> > > Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()
> > > 
> > > In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
> > > when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
> > > with returning zero for n >= LOAD_AVG_MAX_N when decaying in
> > > decay_load() as well instead of only returning zero for n >
> > > LOAD_AVG_PERIOD * 63 (=2016).
> > 
> > So basically, you want to add another rule in addition to the exponential
> > decay rule.
> 
> No, not at all. I want to make the 'rules' symmetrical for accumulation
> and decay exactly like the patch does.

"Make the rule xxx" != change rule or add rule?
 
> > > >    the task's sched avgs MAY NOT be decayed to 0, depending on how
> > > >    big its sums are, and the chance to 0 is still good as load_sum
> > > >    is way less than ~0ULL and util_sum way less than ~0U.
> > > 
> > > I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
> > > Whether you get to zero depends on the sums (as you say) and the actual
> > > value of 'n'. It is true that you might get to zero even if n <
> > > LOAD_AVG_MAX_N if the sums are small.
> >  
> > Frankly, util Ben brought it up, I didn't think a task sleeping so long
> > is even possible. But I do admit it may happen.
> > 
> > However, I will say this. A task sleeping so long is already very rare,
> > and among all those long sleep cases, the chance that after waking up the
> > avgs will not be decayed to zero is much less than 0.5 in a million
> > (=32*64/2^32=1/2^21), assuming the sleeping time is uniformly distributed.
> 
> You can't just ignore cases because they have a low probability. Going
> by that logic we could remove a lot of synchronization overhead in the
> kernel.
> 
> My concern is whether we introduce any assumptions that might hit us
> later when everybody has forgotten about them. This one would be
> extremely hard to debug later.

_NO_, that was just saying chance is very low. No any intent to say nor did I
say low chance doesn't matter. That was why I said the following paragraph.
 
> > > > Nevertheless, what really maters is what happens in the worst-case
> > > > scenario, which is when (u32)m = 0? So in that case, it would be like
> > > > after so long a sleep, we treat the task as it never slept, and has the
> > > > same sched averages as before it slept. That is actually not bad or
> > > > nothing to worry about, and think of it as the same as how we treat
> > > > a new born task.
> > > 
> > > There is subtle but important difference between not decaying a task
> > > correctly and adding new task: The sleeping task is already accounted
> > > for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
> > > to cfs_rq.avg has been decayed correctly in the mean time which means
> > > that by not guaranteeing a decay of the se.avg at wake-up you introduce
> > > a discrepancy between the task's owen view of its contribution (se.avg)
> > > and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
> > > is migrated at wake-up as we remove se.avg amount of contribution from
> > > the previous cpu's cfs_rq.avg. In other words, you remove too much from
> > > the cfs_rq.avg.
> > > 
> > > The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
> > > might be acceptable, but it is not a totally harmless change IMHO.
> > 
> > That is just an explanation, :) nevermind, or I really don't think that is
> > a deal.
> 
> I don't really follow. I analysed the implications of the overflow that
> you are introducing. In my opinion this is what you should have done
> before proposing this patch. I think it is essential to understand what
> assumptions we make and introducing new ones should be carefully
> considered. I think it is a big 'deal' if you are not more careful when you
> are submitting patches and just ignore feedback. We spent a lot of time
> reviewing them.

So I agree "it is not a totally harmless change". But what is the deal/impact
of the harm? The harm in the worse case scenario will not hurt anything, IMHO,
and just an opinion.

Thank you very much for the rewiew. Really appreciate it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()
  2016-05-10  8:46   ` Morten Rasmussen
@ 2016-05-10  2:27     ` Yuyang Du
  0 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-05-10  2:27 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, vincent.guittot,
	dietmar.eggemann, juri.lelli

On Tue, May 10, 2016 at 09:46:03AM +0100, Morten Rasmussen wrote:
> On Wed, May 04, 2016 at 04:02:46AM +0800, Yuyang Du wrote:
> > __update_sched_avg() has these steps:
> > 
> >   1. add the remainder of the last incomplete period
> >   2. decay old sum
> >   3. accumulate new sum in full periods since last_update_time
> >   4. add the current incomplete period
> >   5. update averages
> > 
> > Previously, we separately computed steps 1, 3, and 4, leading to
> > each one of them ugly in codes and costly in overhead.
> > 
> > For example:
> > 
> >              c1          c3           c4
> >              ^           ^            ^
> >              |           |            |
> >            |<->|<----------------->|<--->|
> >    ... |---x---|------| ... |------|-----x (now)
> > 
> > c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> > in timing aspect of steps 1, 3, and 4 respectively.
> > 
> > Then the accumulated contribution to load_sum, for example, is:
> > 
> > contrib = c1 * weight * freq_scaled;
> > contrib += c3 * weight * freq_scaled;
> > contrib += c4 * weight * freq_scaled;
> > 
> > Obviously, we can optimize the above as:
> > 
> > contrib = c1 + c3 + c4;
> > contrib *= weight * freq_scaled;
> 
> This isn't obvious to me. After spending quite a bit of time figuring
> what your code actually does, there is more to it than what you describe
> here. As your label names suggest, you don't consider what happens in
> step 2 where contrib is decayed. When and how the individual bits are
> decayed is essential information.
> 
> Your patch partially moves step 2 (on your list above) before step 1.
> So it becomes:
> 
> a. decay old sum
> b. compute the contribution up to the first 1ms boundary (c1)
> c. decay c1 to get c1'
> d. accumulate the full periods (c3) adding them to c1'
> e. add remaining time up until now (c4) to contrib (c3+c1').
> f. scale by weight and arch_scale_{freq,cpu}_capacity() functions and
>    add to sum.
> 
> The basic difference in the computation is that you consolidate the
> scaling into one operation instead of three at the cost of decaying
> twice instead of once. The net result is saving a few multiplications.
> 
> I would recommend that this is made very clear. Deriving the math from
> the code is daunting task for most people.
 
Agreed.

> An alternative optimization strategy would be to leave c1 as it is and
> thereby avoid to decay twice, and only add up c3 and c4 before scaling
> reducing the number of scaling operations from three to two.

I did struggle a lot about this, balancing code simplicity and overhead.
So I favored simplicity, and believe (in my opinion) it is still an
overhead win.
 
> > 
> > After we combine them together, we will have much cleaner codes
> > and less CPU cycles.
> 
> Could you share any numbers backing up that claim?
> 
> I did a couple of runs on arm64 (Juno):
> 
> perf bench sched messaging -g 50 -l 2500' (10 runs):
> 
> tip:	65.545996712 seconds time elapsed	( +-  0.22% )
> patch:	65.209931026 seconds time elapsed	( +-  0.23% )
> 
> Taking the error margins into account the difference there is still an
> improvement, but it is about as small as it can be without getting lost
> in the noise. Is the picture any better on x86?
>
> Whether the code is cleaner is a subjective opinion. The diffstat below
> doesn't really show any benefit, but I think you have slightly more
> comments so I would not judge based on that.
> 
> When it comes to code structure, the new __update_sched_avg() is a lot
> simpler than __update_load_avg(), but that is only due to the fact that
> most of the content has been move to accumulate_sum() and
> __accumulate_sum(). Where we before had all code in a single function
> with fitted on screen in one go, you know have to consider three
> functions and how they work together to figure out what is really going
> on.
> 
> I haven't found any bugs in the code, but IMHO I don't really see the
> point in rewriting the code completely. The result isn't significantly
> simpler than what we have and generates code churn affecting everyone
> working with this code. I think we can improve the existing code more by
> just factoring out the capacity/weight scaling, which would be much less
> intrusive.
> 
> Maybe I'm missing something, but I don't see a strong argument for this
> refactoring.

Do you have a criteria on how much to improve in perf and code merits a
patch?

To me, you want it significant or anything else?
 
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> > ---
> >  kernel/sched/fair.c |  189 ++++++++++++++++++++++++++-------------------------
> >  1 file changed, 95 insertions(+), 94 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a060ef2..495e5f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
> >   */
> >  #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
> >  #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
> > -#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
> > +#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
> 
> Why +2? I doesn't seem related to anything in this patch or explained
> anywhere.
 
I really should have had a separate patch to explain this. Sorry.

I provided a new program to compute SCHED_AVG_MAX_N or LOAD_AVG_MAX_N, this is
the only difference from Paul's numbers, which definitely deserves a thorough
discussion. I will do it in next version.

> >  
> >  /* Give new sched_entity start runnable values to heavy its load in infant time */
> >  void init_entity_runnable_average(struct sched_entity *se)
> > @@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
> >  
> >  /*
> >   * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> > - * lower integers.
> > + * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
> >   */
> >  static const u32 __accumulated_sum_N32[] = {
> >  	    0, 23371, 35056, 40899, 43820, 45281,
> > @@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
> >  /*
> >   * val * y^n, where y^m ~= 0.5
> >   *
> > - * n is the number of periods past; a period is ~1ms
> > + * n is the number of periods past. A period is ~1ms, so a 32bit
> > + * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
> > + *
> >   * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> > + *
> >   */
> >  static __always_inline u64 __decay_sum(u64 val, u32 n)
> >  {
> 
> The above two hunks seem to belong to some of the previous patches in
> this patch set?

Duh, yes...
 
> > @@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
> >   * We can compute this efficiently by combining:
> >   * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
> >   */
> > -static __always_inline u32 __accumulate_sum(u32 n)
> > +static __always_inline u32
> > +__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
> >  {
> > -	u32 contrib = 0;
> > +	u32 contrib;
> > +
> > +	if (!periods)
> > +		return remainder - period_contrib;
> >  
> > -	if (likely(n <= SCHED_AVG_HALFLIFE))
> > -		return __accumulated_sum_N[n];
> > -	else if (unlikely(n >= SCHED_AVG_MAX_N))
> > +	if (unlikely(periods >= SCHED_AVG_MAX_N))
> >  		return SCHED_AVG_MAX;
> >  
> > -	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
> > -	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
> > -	n %= SCHED_AVG_HALFLIFE;
> > -	contrib = __decay_sum(contrib, n);
> > -	return contrib + __accumulated_sum_N[n];
> > +	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
> > +
> > +	periods -= 1;
> > +	if (likely(periods <= SCHED_AVG_HALFLIFE))
> > +		contrib = __accumulated_sum_N[periods];
> > +	else {
> > +		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
> > +		periods %= SCHED_AVG_HALFLIFE;
> > +		contrib = __decay_sum(contrib, periods);
> > +		contrib += __accumulated_sum_N[periods];
> > +	}
> > +
> > +	return contrib + remainder;
> >  }
> >  
> >  #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> > @@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >  
> >  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> >  
> > +static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
> > +	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
> 
> The unit and meaning of 'delta' confused me a lot when reviewing this
> patch. Here it is the time delta since last update in [us] (duration of
> c1+c3+c4).
> 
> > +{
> > +	u32 contrib, periods;
> > +	unsigned long scale_freq, scale_cpu;
> > +
> > +	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +
> > +	delta += sa->period_contrib;
> 
> Here it is extended to include time previous 1ms boundary.
> 
> > +	periods = delta >> 10; /* A period is 1024us (~1ms) */
> > +
> > +	/*
> > +	 * Accumulating *_sum has two steps.
> > +	 *
> > +	 * Step 1: decay old *_sum if we crossed period boundaries.
> > +	 */
> > +	if (periods) {
> > +		sa->load_sum = __decay_sum(sa->load_sum, periods);
> > +		if (cfs_rq) {
> > +			cfs_rq->runnable_load_sum =
> > +				__decay_sum(cfs_rq->runnable_load_sum, periods);
> > +		}
> > +		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> > +	}
> > +
> > +	/*
> > +	 * Step 2: accumulate new *_sum since last_update_time. This at most has
> > +	 * three parts (at least one part): (1) remainder of incomplete last
> > +	 * period, (2) full periods since (1), and (3) incomplete current period.
> > +	 *
> > +	 * Fortunately, we can (and should) do all these three at once.
> > +	 */
> > +	delta %= 1024;
> 
> Now 'delta' is any leftover contribution to the current 1ms period
> (duration of c4).
> 
> > +	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> > +	sa->period_contrib = delta;
> > +
> > +	contrib = cap_scale(contrib, scale_freq);
> > +	if (weight) {
> > +		sa->load_sum += weight * contrib;
> > +		if (cfs_rq)
> > +			cfs_rq->runnable_load_sum += weight * contrib;
> > +	}
> > +	if (running)
> > +		sa->util_sum += contrib * scale_cpu;
> > +
> > +	return periods;
> > +}
> > +
> >  /*
> >   * We can represent the historical contribution to sched average as the
> >   * coefficients of a geometric series.  To do this we divide the history
> > @@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >   */
> >  static __always_inline int
> >  __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> > -		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > +		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
> >  {
> > -	u64 delta, scaled_delta;
> > -	u32 contrib, periods;
> > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > -	unsigned long scale_freq, scale_cpu;
> > +	u64 delta;
> >  
> >  	delta = now - sa->last_update_time;
> 
> 'delta' is true delta, but in [ns] here.
> 
> I would suggest clearly specifying what each parameter is and its unit
> for each of the helper functions to help people a bit.

Agreed, and I reused the variable.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()
  2016-05-03 20:02 ` [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
@ 2016-05-10  8:46   ` Morten Rasmussen
  2016-05-10  2:27     ` Yuyang Du
  0 siblings, 1 reply; 21+ messages in thread
From: Morten Rasmussen @ 2016-05-10  8:46 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, vincent.guittot,
	dietmar.eggemann, juri.lelli

On Wed, May 04, 2016 at 04:02:46AM +0800, Yuyang Du wrote:
> __update_sched_avg() has these steps:
> 
>   1. add the remainder of the last incomplete period
>   2. decay old sum
>   3. accumulate new sum in full periods since last_update_time
>   4. add the current incomplete period
>   5. update averages
> 
> Previously, we separately computed steps 1, 3, and 4, leading to
> each one of them ugly in codes and costly in overhead.
> 
> For example:
> 
>              c1          c3           c4
>              ^           ^            ^
>              |           |            |
>            |<->|<----------------->|<--->|
>    ... |---x---|------| ... |------|-----x (now)
> 
> c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> in timing aspect of steps 1, 3, and 4 respectively.
> 
> Then the accumulated contribution to load_sum, for example, is:
> 
> contrib = c1 * weight * freq_scaled;
> contrib += c3 * weight * freq_scaled;
> contrib += c4 * weight * freq_scaled;
> 
> Obviously, we can optimize the above as:
> 
> contrib = c1 + c3 + c4;
> contrib *= weight * freq_scaled;

This isn't obvious to me. After spending quite a bit of time figuring
what your code actually does, there is more to it than what you describe
here. As your label names suggest, you don't consider what happens in
step 2 where contrib is decayed. When and how the individual bits are
decayed is essential information.

Your patch partially moves step 2 (on your list above) before step 1.
So it becomes:

a. decay old sum
b. compute the contribution up to the first 1ms boundary (c1)
c. decay c1 to get c1'
d. accumulate the full periods (c3) adding them to c1'
e. add remaining time up until now (c4) to contrib (c3+c1').
f. scale by weight and arch_scale_{freq,cpu}_capacity() functions and
   add to sum.

The basic difference in the computation is that you consolidate the
scaling into one operation instead of three at the cost of decaying
twice instead of once. The net result is saving a few multiplications.

I would recommend that this is made very clear. Deriving the math from
the code is daunting task for most people.

An alternative optimization strategy would be to leave c1 as it is and
thereby avoid to decay twice, and only add up c3 and c4 before scaling
reducing the number of scaling operations from three to two.

> 
> After we combine them together, we will have much cleaner codes
> and less CPU cycles.

Could you share any numbers backing up that claim?

I did a couple of runs on arm64 (Juno):

perf bench sched messaging -g 50 -l 2500' (10 runs):

tip:	65.545996712 seconds time elapsed	( +-  0.22% )
patch:	65.209931026 seconds time elapsed	( +-  0.23% )

Taking the error margins into account the difference there is still an
improvement, but it is about as small as it can be without getting lost
in the noise. Is the picture any better on x86?

Whether the code is cleaner is a subjective opinion. The diffstat below
doesn't really show any benefit, but I think you have slightly more
comments so I would not judge based on that.

When it comes to code structure, the new __update_sched_avg() is a lot
simpler than __update_load_avg(), but that is only due to the fact that
most of the content has been move to accumulate_sum() and
__accumulate_sum(). Where we before had all code in a single function
with fitted on screen in one go, you know have to consider three
functions and how they work together to figure out what is really going
on.

I haven't found any bugs in the code, but IMHO I don't really see the
point in rewriting the code completely. The result isn't significantly
simpler than what we have and generates code churn affecting everyone
working with this code. I think we can improve the existing code more by
just factoring out the capacity/weight scaling, which would be much less
intrusive.

Maybe I'm missing something, but I don't see a strong argument for this
refactoring.

> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  kernel/sched/fair.c |  189 ++++++++++++++++++++++++++-------------------------
>  1 file changed, 95 insertions(+), 94 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a060ef2..495e5f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
>   */
>  #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
>  #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
> -#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
> +#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */

Why +2? I doesn't seem related to anything in this patch or explained
anywhere.

>  
>  /* Give new sched_entity start runnable values to heavy its load in infant time */
>  void init_entity_runnable_average(struct sched_entity *se)
> @@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
>  
>  /*
>   * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> - * lower integers.
> + * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
>   */
>  static const u32 __accumulated_sum_N32[] = {
>  	    0, 23371, 35056, 40899, 43820, 45281,
> @@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
>  /*
>   * val * y^n, where y^m ~= 0.5
>   *
> - * n is the number of periods past; a period is ~1ms
> + * n is the number of periods past. A period is ~1ms, so a 32bit
> + * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
> + *
>   * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> + *
>   */
>  static __always_inline u64 __decay_sum(u64 val, u32 n)
>  {

The above two hunks seem to belong to some of the previous patches in
this patch set?

> @@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
>   * We can compute this efficiently by combining:
>   * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
>   */
> -static __always_inline u32 __accumulate_sum(u32 n)
> +static __always_inline u32
> +__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
>  {
> -	u32 contrib = 0;
> +	u32 contrib;
> +
> +	if (!periods)
> +		return remainder - period_contrib;
>  
> -	if (likely(n <= SCHED_AVG_HALFLIFE))
> -		return __accumulated_sum_N[n];
> -	else if (unlikely(n >= SCHED_AVG_MAX_N))
> +	if (unlikely(periods >= SCHED_AVG_MAX_N))
>  		return SCHED_AVG_MAX;
>  
> -	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
> -	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
> -	n %= SCHED_AVG_HALFLIFE;
> -	contrib = __decay_sum(contrib, n);
> -	return contrib + __accumulated_sum_N[n];
> +	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
> +
> +	periods -= 1;
> +	if (likely(periods <= SCHED_AVG_HALFLIFE))
> +		contrib = __accumulated_sum_N[periods];
> +	else {
> +		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
> +		periods %= SCHED_AVG_HALFLIFE;
> +		contrib = __decay_sum(contrib, periods);
> +		contrib += __accumulated_sum_N[periods];
> +	}
> +
> +	return contrib + remainder;
>  }
>  
>  #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> @@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
>  
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>  
> +static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
> +	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)

The unit and meaning of 'delta' confused me a lot when reviewing this
patch. Here it is the time delta since last update in [us] (duration of
c1+c3+c4).

> +{
> +	u32 contrib, periods;
> +	unsigned long scale_freq, scale_cpu;
> +
> +	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> +	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +
> +	delta += sa->period_contrib;

Here it is extended to include time previous 1ms boundary.

> +	periods = delta >> 10; /* A period is 1024us (~1ms) */
> +
> +	/*
> +	 * Accumulating *_sum has two steps.
> +	 *
> +	 * Step 1: decay old *_sum if we crossed period boundaries.
> +	 */
> +	if (periods) {
> +		sa->load_sum = __decay_sum(sa->load_sum, periods);
> +		if (cfs_rq) {
> +			cfs_rq->runnable_load_sum =
> +				__decay_sum(cfs_rq->runnable_load_sum, periods);
> +		}
> +		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> +	}
> +
> +	/*
> +	 * Step 2: accumulate new *_sum since last_update_time. This at most has
> +	 * three parts (at least one part): (1) remainder of incomplete last
> +	 * period, (2) full periods since (1), and (3) incomplete current period.
> +	 *
> +	 * Fortunately, we can (and should) do all these three at once.
> +	 */
> +	delta %= 1024;

Now 'delta' is any leftover contribution to the current 1ms period
(duration of c4).

> +	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> +	sa->period_contrib = delta;
> +
> +	contrib = cap_scale(contrib, scale_freq);
> +	if (weight) {
> +		sa->load_sum += weight * contrib;
> +		if (cfs_rq)
> +			cfs_rq->runnable_load_sum += weight * contrib;
> +	}
> +	if (running)
> +		sa->util_sum += contrib * scale_cpu;
> +
> +	return periods;
> +}
> +
>  /*
>   * We can represent the historical contribution to sched average as the
>   * coefficients of a geometric series.  To do this we divide the history
> @@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
>   */
>  static __always_inline int
>  __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> -		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> +		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
> -	u64 delta, scaled_delta;
> -	u32 contrib, periods;
> -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> -	unsigned long scale_freq, scale_cpu;
> +	u64 delta;
>  
>  	delta = now - sa->last_update_time;

'delta' is true delta, but in [ns] here.

I would suggest clearly specifying what each parameter is and its unit
for each of the helper functions to help people a bit.

Morten

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
  2016-05-05 18:19     ` Yuyang Du
@ 2016-05-10  9:10       ` Morten Rasmussen
  2016-05-10  2:05         ` Yuyang Du
  0 siblings, 1 reply; 21+ messages in thread
From: Morten Rasmussen @ 2016-05-10  9:10 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, linux-kernel, bsegall, pjt, vincent.guittot,
	dietmar.eggemann, juri.lelli

On Fri, May 06, 2016 at 02:19:32AM +0800, Yuyang Du wrote:
> Hi Morten,
> 
> On Thu, May 05, 2016 at 12:13:10PM +0100, Morten Rasmussen wrote:
> > On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:
> > > In sched average update, a period is about 1ms, so a 32-bit unsigned
> > > integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
> > > days.
> > > 
> > > For usual cases, 32bit is big enough and 64bit is needless. But if
> > > a task sleeps longer than it, there can be two outcomes:
> > > 
> > > Consider a task sleeps whatever m milliseconds, then n = (u32)m.
> > 
> > Maybe it crystal clear that you mean: For any m > UINT_MAX?
> 
> En, literally whatever.

Your second case below is not correct for m < UINT_MAX. You are only
talking about cases where n != m which is m > UINT_MAX.

> 
> > > 1. If n >= 32*64, then the task's sched avgs will be surely decayed
> > >    to 0. In this case, it really doesn't matter that the 32bit is not
> > >    big enough to hold m. In other words, a task sleeps 2 secs or sleeps
> > >    50 days are the same from sched average point of view.
> > > 
> > > 2. If m < 32*64, the chance to be here is very very low, but if so,
> > 
> > Should that be: n < 32*64 ?
> > 
> > Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
> > a patch ready to post for that:
> >
> > >From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > Date: Mon, 11 Apr 2016 15:41:37 +0100
> > Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()
> > 
> > In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
> > when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
> > with returning zero for n >= LOAD_AVG_MAX_N when decaying in
> > decay_load() as well instead of only returning zero for n >
> > LOAD_AVG_PERIOD * 63 (=2016).
> 
> So basically, you want to add another rule in addition to the exponential
> decay rule.

No, not at all. I want to make the 'rules' symmetrical for accumulation
and decay exactly like the patch does.

>  
> > >    the task's sched avgs MAY NOT be decayed to 0, depending on how
> > >    big its sums are, and the chance to 0 is still good as load_sum
> > >    is way less than ~0ULL and util_sum way less than ~0U.
> > 
> > I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
> > Whether you get to zero depends on the sums (as you say) and the actual
> > value of 'n'. It is true that you might get to zero even if n <
> > LOAD_AVG_MAX_N if the sums are small.
>  
> Frankly, util Ben brought it up, I didn't think a task sleeping so long
> is even possible. But I do admit it may happen.
> 
> However, I will say this. A task sleeping so long is already very rare,
> and among all those long sleep cases, the chance that after waking up the
> avgs will not be decayed to zero is much less than 0.5 in a million
> (=32*64/2^32=1/2^21), assuming the sleeping time is uniformly distributed.

You can't just ignore cases because they have a low probability. Going
by that logic we could remove a lot of synchronization overhead in the
kernel.

My concern is whether we introduce any assumptions that might hit us
later when everybody has forgotten about them. This one would be
extremely hard to debug later.

> > > Nevertheless, what really maters is what happens in the worst-case
> > > scenario, which is when (u32)m = 0? So in that case, it would be like
> > > after so long a sleep, we treat the task as it never slept, and has the
> > > same sched averages as before it slept. That is actually not bad or
> > > nothing to worry about, and think of it as the same as how we treat
> > > a new born task.
> > 
> > There is subtle but important difference between not decaying a task
> > correctly and adding new task: The sleeping task is already accounted
> > for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
> > to cfs_rq.avg has been decayed correctly in the mean time which means
> > that by not guaranteeing a decay of the se.avg at wake-up you introduce
> > a discrepancy between the task's owen view of its contribution (se.avg)
> > and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
> > is migrated at wake-up as we remove se.avg amount of contribution from
> > the previous cpu's cfs_rq.avg. In other words, you remove too much from
> > the cfs_rq.avg.
> > 
> > The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
> > might be acceptable, but it is not a totally harmless change IMHO.
> 
> That is just an explanation, :) nevermind, or I really don't think that is
> a deal.

I don't really follow. I analysed the implications of the overflow that
you are introducing. In my opinion this is what you should have done
before proposing this patch. I think it is essential to understand what
assumptions we make and introducing new ones should be carefully
considered. I think it is a big 'deal' if you are not more careful when you
are submitting patches and just ignore feedback. We spent a lot of time
reviewing them.

Morten

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-10 10:09 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
2016-05-03 20:02 ` [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
2016-05-03 20:02 ` [PATCH v3 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
2016-05-03 20:02 ` [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit Yuyang Du
2016-05-05 11:13   ` Morten Rasmussen
2016-05-05 11:23     ` Vincent Guittot
2016-05-05 18:19     ` Yuyang Du
2016-05-10  9:10       ` Morten Rasmussen
2016-05-10  2:05         ` Yuyang Du
2016-05-03 20:02 ` [PATCH v3 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
2016-05-03 20:02 ` [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
2016-05-10  8:46   ` Morten Rasmussen
2016-05-10  2:27     ` Yuyang Du
2016-05-03 20:02 ` [PATCH v3 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
2016-05-03 20:02 ` [PATCH v3 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
2016-05-03 20:02 ` [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
2016-05-05  7:46   ` [PATCH v4] " Ingo Molnar
2016-05-03 20:02 ` [PATCH v3 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
2016-05-03 20:02 ` [PATCH v3 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-05-03 20:02 ` [PATCH v3 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-05-03 20:02 ` [PATCH v3 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).