linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages
@ 2016-05-02 21:54 Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
                   ` (12 more replies)
  0 siblings, 13 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Hi Peter,

This patch series combines the previous cleanup and optimization
series. And as you and Ingo suggested, the increased kernel load
scale is reinstated when on 64BIT and FAIR_GROUP_SCHED. In addition
to that, the changes include Vincent's fix, typos fixes, changelog
and comment reword.

Thanks,
Yuyang

Yuyang Du (12):
  sched/fair: Optimize sum computation with a lookup table
  sched/fair: Rename variable names for sched averages
  sched/fair: Change the variable to hold the number of periods to
    32bit integer
  sched/fair: Add __always_inline compiler attribute to
    __accumulate_sum()
  sched/fair: Optimize __update_sched_avg()
  documentation: Add scheduler/sched-avg.txt
  sched/fair: Generalize the load/util averages resolution definition
  sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  sched/fair: Add introduction to the sched average metrics
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Enable increased scale for kernel load

 Documentation/scheduler/sched-avg.txt |  137 ++++++++++++
 include/linux/sched.h                 |   81 ++++++-
 kernel/sched/core.c                   |    8 +-
 kernel/sched/fair.c                   |  398 +++++++++++++++++----------------
 kernel/sched/sched.h                  |   48 ++--
 5 files changed, 439 insertions(+), 233 deletions(-)
 create mode 100644 Documentation/scheduler/sched-avg.txt

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 01/12] sched/fair: Optimize sum computation with a lookup table
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-05  9:41   ` [tip:sched/core] " tip-bot for Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant time.

The program to generate the constants is located at:
Documentation/scheduler/sched-avg.txt

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8a33ab..e803f11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
 };
 
 /*
+ * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
+ * lower integers.
+ */
+static const u32 __accumulated_sum_N32[] = {
+	    0, 23371, 35056, 40899, 43820, 45281,
+	46011, 46376, 46559, 46650, 46696, 46719,
+};
+
+/*
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
@@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
 	else if (unlikely(n >= LOAD_AVG_MAX_N))
 		return LOAD_AVG_MAX;
 
-	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
-	do {
-		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
-		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
-
-		n -= LOAD_AVG_PERIOD;
-	} while (n > LOAD_AVG_PERIOD);
-
+	/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
+	contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
+	n %= LOAD_AVG_PERIOD;
 	contrib = decay_load(contrib, n);
 	return contrib + runnable_avg_yN_sum[n];
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 02/12] sched/fair: Rename variable names for sched averages
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 03/12] sched/fair: Change the variable to hold the number of periods to 32bit integer Yuyang Du
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

The names of sched averages (including load_avg and util_avg) have
been changed and added in the past a couple of years, some of
the names are a bit confusing especially to people who first read them.
This patch attempts to make the names more self-explaining. And some
comments are updated too.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  209 ++++++++++++++++++++++++++-------------------------
 1 file changed, 107 insertions(+), 102 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e803f11..74eaeab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -660,13 +660,15 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
- * We choose a half-life close to 1 scheduling period.
- * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are
- * dependent on this value.
+ * Note: everything in sched average calculation, including
+ * __decay_inv_multiply_N, __accumulated_sum_N, __accumulated_sum_N32,
+ * SCHED_AVG_MAX, and SCHED_AVG_MAX_N are all dependent on and only on
+ * (1) exponential decay, (2) a period of 1024*1024ns (~1ms), and (3)
+ * a half-life of 32 periods.
  */
-#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
+#define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
+#define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
+#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -681,7 +683,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 */
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+	sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -731,7 +733,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		} else {
 			sa->util_avg = cap;
 		}
-		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+		sa->util_sum = sa->util_avg * SCHED_AVG_MAX;
 	}
 }
 
@@ -1834,7 +1836,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.load_sum / p->se.load.weight;
-		*period = LOAD_AVG_MAX;
+		*period = SCHED_AVG_MAX;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2583,7 +2585,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 #ifdef CONFIG_SMP
 /* Precomputed fixed inverse multiplies for multiplication by y^n */
-static const u32 runnable_avg_yN_inv[] = {
+static const u32 __decay_inv_multiply_N[] = {
 	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
 	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
 	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
@@ -2596,7 +2598,7 @@ static const u32 runnable_avg_yN_inv[] = {
  * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
  * over-estimates when re-combining.
  */
-static const u32 runnable_avg_yN_sum[] = {
+static const u32 __accumulated_sum_N[] = {
 	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
 	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
 	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
@@ -2612,16 +2614,18 @@ static const u32 __accumulated_sum_N32[] = {
 };
 
 /*
- * Approximate:
- *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ * val * y^n, where y^m ~= 0.5
+ *
+ * n is the number of periods past; a period is ~1ms
+ * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
  */
-static __always_inline u64 decay_load(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u64 n)
 {
 	unsigned int local_n;
 
 	if (!n)
 		return val;
-	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+	else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
 		return 0;
 
 	/* after bounds checking we can collapse to 32-bit */
@@ -2634,36 +2638,36 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 	 *
 	 * To achieve constant time decay_load.
 	 */
-	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
-		val >>= local_n / LOAD_AVG_PERIOD;
-		local_n %= LOAD_AVG_PERIOD;
+	if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
+		val >>= local_n / SCHED_AVG_HALFLIFE;
+		local_n %= SCHED_AVG_HALFLIFE;
 	}
 
-	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
 	return val;
 }
 
 /*
- * For updates fully spanning n periods, the contribution to runnable
- * average will be: \Sum 1024*y^n
+ * For updates fully spanning n periods, the accumulated contribution
+ * will be: \Sum 1024*y^n.
  *
- * We can compute this reasonably efficiently by combining:
- *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
+ * We can compute this efficiently by combining:
+ * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __compute_runnable_contrib(u64 n)
+static u32 __accumulate_sum(u64 n)
 {
 	u32 contrib = 0;
 
-	if (likely(n <= LOAD_AVG_PERIOD))
-		return runnable_avg_yN_sum[n];
-	else if (unlikely(n >= LOAD_AVG_MAX_N))
-		return LOAD_AVG_MAX;
+	if (likely(n <= SCHED_AVG_HALFLIFE))
+		return __accumulated_sum_N[n];
+	else if (unlikely(n >= SCHED_AVG_MAX_N))
+		return SCHED_AVG_MAX;
 
-	/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
-	contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
-	n %= LOAD_AVG_PERIOD;
-	contrib = decay_load(contrib, n);
-	return contrib + runnable_avg_yN_sum[n];
+	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
+	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
+	n %= SCHED_AVG_HALFLIFE;
+	contrib = __decay_sum(contrib, n);
+	return contrib + __accumulated_sum_N[n];
 }
 
 #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2673,35 +2677,35 @@ static u32 __compute_runnable_contrib(u64 n)
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series.  To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ * We can represent the historical contribution to sched average as the
+ * coefficients of a geometric series.  To do this we divide the history
+ * into segments of approximately 1ms (1024*1024ns); label the segment that
+ * occurred N-1024us ago p_N, with p_0 corresponding to the current period, e.g.
  *
  * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
  *      p0            p1           p2
  *     (now)       (~1ms ago)  (~2ms ago)
  *
- * Let u_i denote the fraction of p_i that the entity was runnable.
+ * Let u_i denote the fraction of p_i whose state (runnable/running) we count.
  *
  * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
+ * following representation of a sched metric:
  *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
  *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- *   y^32 = 0.5
+ * We choose y based on a half-life of 32 periods (which is ~32ms):
+ *   y^32 = 0.5 => y = (0.5)^(1/32)
  *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
+ * where 32 is the number of periods that a past period's contribution is
+ * halved. This means that the impact of a period every ~32ms ago will be
+ * as much as 50% of the previous value.
  *
  * When a period "rolls over" and we have new u_0`, multiplying the previous
  * sum again by y is sufficient to update:
- *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ *   avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *       = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+__update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta, scaled_delta, periods;
@@ -2762,15 +2766,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		periods = delta / 1024;
 		delta %= 1024;
 
-		sa->load_sum = decay_load(sa->load_sum, periods + 1);
+		sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
 		if (cfs_rq) {
 			cfs_rq->runnable_load_sum =
-				decay_load(cfs_rq->runnable_load_sum, periods + 1);
+				__decay_sum(cfs_rq->runnable_load_sum, periods + 1);
 		}
-		sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
+		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		contrib = __compute_runnable_contrib(periods);
+		contrib = __accumulate_sum(periods);
 		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
@@ -2794,12 +2798,12 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	sa->period_contrib += delta;
 
 	if (decayed) {
-		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+		sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
 		if (cfs_rq) {
 			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+				div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
 		}
-		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 	}
 
 	return decayed;
@@ -2867,8 +2871,8 @@ void set_task_rq_fair(struct sched_entity *se,
 		p_last_update_time = prev->avg.last_update_time;
 		n_last_update_time = next->avg.last_update_time;
 #endif
-		__update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
-				  &se->avg, 0, 0, NULL);
+		__update_sched_avg(p_last_update_time, cpu_of(rq_of(prev)),
+				   &se->avg, 0, 0, NULL);
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
@@ -2909,7 +2913,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
+update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed, removed_load = 0, removed_util = 0;
@@ -2917,18 +2921,18 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sa->load_sum = max_t(s64, sa->load_sum - r * SCHED_AVG_MAX, 0);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sa->util_sum = max_t(s32, sa->util_sum - r * SCHED_AVG_MAX, 0);
 		removed_util = 1;
 	}
 
-	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+	decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
 		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
@@ -2943,7 +2947,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 }
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_sched_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -2954,15 +2958,15 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	__update_sched_avg(now, cpu, &se->avg,
+			   se->on_rq * scale_load_down(se->load.weight),
+			   cfs_rq->curr == se, NULL);
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
+	if (update_cfs_rq_sched_avg(now, cfs_rq, true) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
 }
 
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (!sched_feat(ATTACH_AGE_LOAD))
 		goto skip_aging;
@@ -2972,8 +2976,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * have aged the average right before clearing @last_update_time.
 	 */
 	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
+		__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+				   &se->avg, 0, 0, NULL);
 
 		/*
 		 * XXX: we could have just aged the entire load away if we've been
@@ -2991,11 +2995,11 @@ skip_aging:
 	cfs_rq_util_change(cfs_rq);
 }
 
-static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+			   &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			   cfs_rq->curr == se, NULL);
 
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
 	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
@@ -3007,7 +3011,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 /* Add the load generated by se into cfs_rq's load average */
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3015,18 +3019,18 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	migrated = !sa->last_update_time;
 	if (!migrated) {
-		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
-			cfs_rq->curr == se, NULL);
+		__update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+				   se->on_rq * scale_load_down(se->load.weight),
+				   cfs_rq->curr == se, NULL);
 	}
 
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, !migrated);
+	decayed = update_cfs_rq_sched_avg(now, cfs_rq, !migrated);
 
 	cfs_rq->runnable_load_avg += sa->load_avg;
 	cfs_rq->runnable_load_sum += sa->load_sum;
 
 	if (migrated)
-		attach_entity_load_avg(cfs_rq, se);
+		attach_entity_sched_avg(cfs_rq, se);
 
 	if (decayed || migrated)
 		update_tg_load_avg(cfs_rq, 0);
@@ -3034,9 +3038,9 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
+	update_sched_avg(se, 1);
 
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
@@ -3069,7 +3073,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
  */
-void remove_entity_load_avg(struct sched_entity *se)
+void remove_entity_sched_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 last_update_time;
@@ -3083,7 +3087,8 @@ void remove_entity_load_avg(struct sched_entity *se)
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
 
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+	__update_sched_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
+			   &se->avg, 0, 0, NULL);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
 	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
 }
@@ -3102,17 +3107,17 @@ static int idle_balance(struct rq *this_rq);
 
 #else /* CONFIG_SMP */
 
-static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void update_sched_avg(struct sched_entity *se, int update_tg) {}
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void remove_entity_load_avg(struct sched_entity *se) {}
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void remove_entity_sched_avg(struct sched_entity *se) {}
 
 static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 
 static inline int idle_balance(struct rq *rq)
 {
@@ -3272,7 +3277,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_entity_sched_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -3351,7 +3356,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_entity_sched_avg(cfs_rq, se);
 
 	if (schedstat_enabled())
 		update_stats_dequeue(cfs_rq, se, flags);
@@ -3431,7 +3436,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (schedstat_enabled())
 			update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3535,7 +3540,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_sched_avg(prev, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3551,7 +3556,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_sched_avg(curr, 1);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4424,7 +4429,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4484,7 +4489,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_sched_avg(se, 1);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -5370,7 +5375,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
 	 * will result in the wakee task is less decayed, but giving the wakee more
 	 * load sounds not bad.
 	 */
-	remove_entity_load_avg(&p->se);
+	remove_entity_sched_avg(&p->se);
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;
@@ -5381,7 +5386,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
 
 static void task_dead_fair(struct task_struct *p)
 {
-	remove_entity_load_avg(&p->se);
+	remove_entity_sched_avg(&p->se);
 }
 #endif /* CONFIG_SMP */
 
@@ -6262,7 +6267,7 @@ static void update_blocked_averages(int cpu)
 		if (throttled_hierarchy(cfs_rq))
 			continue;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
+		if (update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -6323,7 +6328,7 @@ static inline void update_blocked_averages(int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 	update_rq_clock(rq);
-	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
+	update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
@@ -8357,7 +8362,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	detach_entity_load_avg(cfs_rq, se);
+	detach_entity_sched_avg(cfs_rq, se);
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -8374,7 +8379,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 #endif
 
 	/* Synchronize task with its cfs_rq */
-	attach_entity_load_avg(cfs_rq, se);
+	attach_entity_sched_avg(cfs_rq, se);
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
@@ -8513,7 +8518,7 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 	for_each_possible_cpu(cpu) {
 		if (tg->se[cpu])
-			remove_entity_load_avg(tg->se[cpu]);
+			remove_entity_sched_avg(tg->se[cpu]);
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 03/12] sched/fair: Change the variable to hold the number of periods to 32bit integer
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-03  8:47   ` Peter Zijlstra
  2016-05-02 21:54 ` [PATCH v2 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

In sched average update, a period is about 1ms, so a 32-bit unsigned
integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
days, which means it is big enough and 64-bit is needless.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74eaeab..17bc721 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2619,18 +2619,13 @@ static const u32 __accumulated_sum_N32[] = {
  * n is the number of periods past; a period is ~1ms
  * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
  */
-static __always_inline u64 __decay_sum(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u32 n)
 {
-	unsigned int local_n;
-
 	if (!n)
 		return val;
 	else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
 		return 0;
 
-	/* after bounds checking we can collapse to 32-bit */
-	local_n = n;
-
 	/*
 	 * As y^PERIOD = 1/2, we can combine
 	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
@@ -2638,12 +2633,12 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
 	 *
 	 * To achieve constant time decay_load.
 	 */
-	if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
-		val >>= local_n / SCHED_AVG_HALFLIFE;
-		local_n %= SCHED_AVG_HALFLIFE;
+	if (unlikely(n >= SCHED_AVG_HALFLIFE)) {
+		val >>= n / SCHED_AVG_HALFLIFE;
+		n %= SCHED_AVG_HALFLIFE;
 	}
 
-	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
+	val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
 	return val;
 }
 
@@ -2654,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __accumulate_sum(u64 n)
+static u32 __accumulate_sum(u32 n)
 {
 	u32 contrib = 0;
 
@@ -2708,8 +2703,8 @@ static __always_inline int
 __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
-	u32 contrib;
+	u64 delta, scaled_delta;
+	u32 contrib, periods;
 	unsigned int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq, scale_cpu;
 
@@ -2762,7 +2757,11 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		delta -= delta_w;
 
-		/* Figure out how many additional periods this update spans */
+		/*
+		 * Figure out how many additional periods this update spans.
+		 * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
+		 * approximately a maximum of 49 (=2^32/1000/3600/24) days.
+		 */
 		periods = delta / 1024;
 		delta %= 1024;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum()
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (2 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 03/12] sched/fair: Change the variable to hold the number of periods to 32bit integer Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Everybody has it. If code-size is not the problem, __accumulate_sum()
should have it too.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 17bc721..1655280 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2649,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static u32 __accumulate_sum(u32 n)
+static __always_inline u32 __accumulate_sum(u32 n)
 {
 	u32 contrib = 0;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 05/12] sched/fair: Optimize __update_sched_avg()
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (3 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-03  8:49   ` Peter Zijlstra
  2016-05-02 21:54 ` [PATCH v2 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

__update_sched_avg() has these steps:
1. add the left of the last incomplete period
2. decay old sum
3. accumulate new sum since last_update_time
4. add the current incomplete period
5. update averages

Previously, we separately computed steps 1, 3, and 4, leading to
each one of them ugly in codes and costly in overhead. But actually
they all do the same thing, so we combine them together. The result
will be much cleaner codes and less CPU cycles.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |  185 +++++++++++++++++++++++++--------------------------
 1 file changed, 92 insertions(+), 93 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1655280..ea99c2c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
  */
 #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
 #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
-#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
+#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
 
 /*
  * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
- * lower integers.
+ * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
  */
 static const u32 __accumulated_sum_N32[] = {
 	    0, 23371, 35056, 40899, 43820, 45281,
@@ -2649,20 +2649,31 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
  * We can compute this efficiently by combining:
  * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
  */
-static __always_inline u32 __accumulate_sum(u32 n)
+static __always_inline u32
+__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
 {
-	u32 contrib = 0;
+	u32 contrib;
 
-	if (likely(n <= SCHED_AVG_HALFLIFE))
-		return __accumulated_sum_N[n];
-	else if (unlikely(n >= SCHED_AVG_MAX_N))
+	if (!periods)
+		return remainder - period_contrib;
+
+	if (unlikely(periods >= SCHED_AVG_MAX_N))
 		return SCHED_AVG_MAX;
 
-	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
-	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
-	n %= SCHED_AVG_HALFLIFE;
-	contrib = __decay_sum(contrib, n);
-	return contrib + __accumulated_sum_N[n];
+	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
+
+	periods -= 1;
+	if (likely(periods <= SCHED_AVG_HALFLIFE))
+		contrib = __accumulated_sum_N[periods];
+	else {
+		/*(periods>>5) = (periods/SCHED_AVG_HALFLIFE) */
+		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
+		periods %= SCHED_AVG_HALFLIFE;
+		contrib = __decay_sum(contrib, periods);
+		contrib += __accumulated_sum_N[periods];
+	}
+
+	return contrib + remainder;
 }
 
 #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2671,6 +2682,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
 
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
+static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
+	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
+{
+	u32 contrib, periods;
+	unsigned long scale_freq, scale_cpu;
+
+	scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+	delta += sa->period_contrib;
+	periods = delta >> 10; /* A period is 1024us (~1ms) */
+
+	/*
+	 * Accumulating *_sum has two steps.
+	 *
+	 * Step 1: decay old *_sum if we crossed period boundaries.
+	 */
+	if (periods) {
+		sa->load_sum = __decay_sum(sa->load_sum, periods);
+		if (cfs_rq) {
+			cfs_rq->runnable_load_sum =
+				__decay_sum(cfs_rq->runnable_load_sum, periods);
+		}
+		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
+	}
+
+	/*
+	 * Step 2: accumulate new *_sum since last_update_time. This at most has
+	 * three parts (at least one part): (1) remainder of incomplete last
+	 * period, (2) full periods since (1), and (3) incomplete current period.
+	 *
+	 * Fortunately, we can (and should) do all these three at once.
+	 */
+	delta %= 1024;
+	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
+	sa->period_contrib = delta;
+
+	contrib = cap_scale(contrib, scale_freq);
+	if (weight) {
+		sa->load_sum += weight * contrib;
+		if (cfs_rq)
+			cfs_rq->runnable_load_sum += weight * contrib;
+	}
+	if (running)
+		sa->util_sum += contrib * scale_cpu;
+
+	return periods;
+}
+
 /*
  * We can represent the historical contribution to sched average as the
  * coefficients of a geometric series.  To do this we divide the history
@@ -2701,12 +2761,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
  */
 static __always_inline int
 __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
+		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta;
-	u32 contrib, periods;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	u64 delta;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2727,85 +2784,27 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
-	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->period_contrib;
-	if (delta + delta_w >= 1024) {
-		decayed = 1;
-
-		/* how much left for next period will start over, we don't know yet */
-		sa->period_contrib = 0;
-
-		/*
-		 * Now that we know we're crossing a period boundary, figure
-		 * out how much from delta we need to complete the current
-		 * period and accrue it.
-		 */
-		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
-		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
-			if (cfs_rq) {
-				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
-			}
-		}
-		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
-
-		delta -= delta_w;
-
-		/*
-		 * Figure out how many additional periods this update spans.
-		 * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
-		 * approximately a maximum of 49 (=2^32/1000/3600/24) days.
-		 */
-		periods = delta / 1024;
-		delta %= 1024;
-
-		sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_sum =
-				__decay_sum(cfs_rq->runnable_load_sum, periods + 1);
-		}
-		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
-
-		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		contrib = __accumulate_sum(periods);
-		contrib = cap_scale(contrib, scale_freq);
-		if (weight) {
-			sa->load_sum += weight * contrib;
-			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
-		}
-		if (running)
-			sa->util_sum += contrib * scale_cpu;
-	}
-
-	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
-	if (weight) {
-		sa->load_sum += weight * scaled_delta;
-		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
-	}
-	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
-
-	sa->period_contrib += delta;
+	/*
+	 * Now we know we crossed measurement unit boundaries. The *_avg
+	 * accrues by two steps:
+	 *
+	 * Step 1: accumulate *_sum since last_update_time. If we haven't
+	 * crossed period boundaries, finish.
+	 */
+	if (!accumulate_sum(delta, sa, cfs_rq, cpu, weight, running))
+		return 0;
 
-	if (decayed) {
-		sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
-		}
-		sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
+	/*
+	 * Step 2: update *_avg.
+	 */
+	sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
+	if (cfs_rq) {
+		cfs_rq->runnable_load_avg =
+			div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
 	}
+	sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
 
-	return decayed;
+	return 1;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 06/12] documentation: Add scheduler/sched-avg.txt
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (4 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-03  8:50   ` Peter Zijlstra
  2016-05-02 21:54 ` [PATCH v2 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

This doc file has the programs to generate the constants to compute
sched averages.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 Documentation/scheduler/sched-avg.txt |  137 +++++++++++++++++++++++++++++++++
 1 file changed, 137 insertions(+)
 create mode 100644 Documentation/scheduler/sched-avg.txt

diff --git a/Documentation/scheduler/sched-avg.txt b/Documentation/scheduler/sched-avg.txt
new file mode 100644
index 0000000..ae4132f
--- /dev/null
+++ b/Documentation/scheduler/sched-avg.txt
@@ -0,0 +1,137 @@
+The following programs are used to generate the constants for
+computing sched averages.
+
+==============================================================
+		C program (compile with -lm)
+==============================================================
+
+#include <math.h>
+#include <stdio.h>
+
+#define HALFLIFE 32
+#define SHIFT 32
+
+double y;
+
+void calc_decay_inv_multiply() {
+	int i;
+	unsigned int x;
+
+	printf("static const u32 __decay_inv_multiply_N[] = {");
+	for(i = 0; i < HALFLIFE; i++) {
+		x = ((1UL<<32)-1)*pow(y, i);
+
+		if (i % 6 == 0) printf("\n\t");
+		printf("0x%8x, ", x);
+	}
+	printf("\n};\n\n");
+}
+
+int sum = 1024;
+void calc_accumulated_sum() {
+	int i;
+
+	printf("static const u32 __accumulated_sum_N[] = {\n\t    0,");
+	for(i = 1; i <= HALFLIFE; i++) {
+		if (i == 1)
+			sum *= y;
+		else
+			sum = sum*y + 1024*y;
+
+		if (i % 11 == 0) printf("\n\t");
+		printf("%5d,", sum);
+	}
+	printf("\n};\n\n");
+}
+
+int n = 1;
+/* first period */
+long max = 1024;
+
+void calc_converged_max() {
+	long last = 0, y_inv = ((1UL<<32)-1)*y;
+
+	for (; ; n++) {
+		if (n > 1)
+			max = ((max*y_inv)>>SHIFT) + 1024;
+			/*
+			 * This is the same as:
+			 * max = max*y + 1024;
+			 */
+
+		if (last == max)
+			break;
+
+		last = max;
+	}
+	n--;
+	printf("#define SCHED_AVG_HALFLIFE %d\n", HALFLIFE);
+	printf("#define SCHED_AVG_MAX %ld\n", max);
+	printf("#define SCHED_AVG_MAX_N %d\n\n", n);
+}
+
+void calc_accumulated_sum_32() {
+	int i, x = sum;
+
+	printf("static const u32 __accumulated_sum_N32[] = {\n\t     0,");
+	for(i = 1; i <= n/HALFLIFE+1; i++) {
+		if (i > 1)
+			x = x/2 + sum;
+
+		if (i % 6 == 0) printf("\n\t");
+		printf("%6d,", x);
+	}
+	printf("\n};\n\n");
+}
+
+void main() {
+	y = pow(0.5, 1/(double)HALFLIFE);
+
+	calc_decay_inv_multiply();
+	calc_accumulated_sum();
+	calc_converged_max();
+	calc_accumulated_sum_32();
+}
+
+==============================================================
+		Python script if you speak snake
+==============================================================
+
+#!/usr/bin/env python
+
+print " #:     yN_inv   yN_sum"
+print "-----------------------"
+y = (0.5)**(1/32.0)
+x = 2**32
+xx = 1024
+for i in range(0, 32):
+	if i == 0:
+		x = x-1
+		xx = xx*y
+	else:
+		x = x*y
+		xx = int(xx*y + 1024*y)
+	print "%2d: %#x %8d" % (i, int(x), int(xx))
+
+print
+print " #:  sum_N32"
+print "------------"
+xxx = xx
+for i in range(0, 11):
+	if i > 0:
+		xxx = xxx/2 + xx
+	print "%2d: %8d" % (i, xxx)
+
+print
+print "  n:     max"
+print "------------"
+xxxx = 1024
+old = 0
+i = 2
+while (1):
+	xxxx = int(xxxx*y + 1024)
+	if old == xxxx:
+		break
+	i = i+1
+	old = xxxx
+print "%3d: %7d" % (i-1, xxxx)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 07/12] sched/fair: Generalize the load/util averages resolution definition
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (5 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, including weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges.

In order to avoid errors relating to the fixed point range of these
metrics, we definie a basic fixed point range, and then formalize all
metrics to base on the basic range.

The basic range is 1024. Further, one can recursively apply this basic
range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   16 +++++++++++++---
 kernel/sched/fair.c   |    4 ----
 kernel/sched/sched.h  |   15 ++++++++++-----
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad9454d..33e7929 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,9 +937,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT	10
+# define SCHED_FIXEDPOINT_SCALE	(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1205,8 +1215,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea99c2c..69bfb07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2676,10 +2676,6 @@ __accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
 	return contrib + remainder;
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69da6fc..996a137 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (6 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c  |    4 ++--
 kernel/sched/sched.h |   22 +++++++++++-----------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69bfb07..fa79820 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -721,7 +721,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
-	long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -7017,7 +7017,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
+				      scale_load_down(NICE_0_LOAD);
 		if (load_above_capacity > busiest->group_capacity)
 			load_above_capacity -= busiest->group_capacity;
 		else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 996a137..1a3be6f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOAD		SCHED_LOAD_SCALE
-#define NICE_0_SHIFT		SCHED_LOAD_SHIFT
+#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -861,7 +861,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
 	atomic_t ref;
 	/*
-	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
 	unsigned int capacity;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 09/12] sched/fair: Add introduction to the sched average metrics
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (7 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

These sched metrics have become complex enough. We introduce them
at their definitions.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   60 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 33e7929..a7cddd6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,18 +1211,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_sched_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *    Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
 	u64 last_update_time, load_sum;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 10/12] sched/fair: Remove scale_load_down() for load_avg
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (8 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher fixed point range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased fixed point range. In this case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of the day, on the condition where it occurs, the
load average will not be useful anyway. And afterwards if the machine
can survive, the load will correct itself very quickly in no more than
~2 seconds (=32ms*64).

[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   19 ++++++++++++++-----
 kernel/sched/fair.c   |   19 +++++++++----------
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7cddd6..b718cb0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1216,7 +1216,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1224,7 +1224,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalize to the highest frequency
  *
@@ -1250,9 +1250,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32bit kernel), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa79820..504803e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * will definitely be update (after enqueue).
 	 */
 	sa->period_contrib = 1023;
-	sa->load_avg = scale_load_down(se->load.weight);
+	sa->load_avg = se->load.weight;
 	sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
@@ -2927,7 +2927,7 @@ update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	}
 
 	decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+				     cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -2952,8 +2952,7 @@ static inline void update_sched_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_sched_avg(now, cpu, &se->avg,
-			   se->on_rq * scale_load_down(se->load.weight),
+	__update_sched_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	if (update_cfs_rq_sched_avg(now, cfs_rq, true) && update_tg)
@@ -2992,7 +2991,7 @@ skip_aging:
 static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			   &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			   &se->avg, se->on_rq * se->load.weight,
 			   cfs_rq->curr == se, NULL);
 
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
@@ -3014,7 +3013,7 @@ enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	migrated = !sa->last_update_time;
 	if (!migrated) {
 		__update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-				   se->on_rq * scale_load_down(se->load.weight),
+				   se->on_rq * se->load.weight,
 				   cfs_rq->curr == se, NULL);
 	}
 
@@ -7016,10 +7015,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-				      scale_load_down(NICE_0_LOAD);
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
+		if (load_above_capacity > scale_load(busiest->group_capacity))
+			load_above_capacity -=
+				scale_load(busiest->group_capacity);
 		else
 			load_above_capacity = ~0UL;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 11/12] sched/fair: Rename scale_load() and scale_load_down()
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (9 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 21:54 ` [PATCH v2 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du
  2016-05-02 22:08 ` [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively. This helps us tag them
clearly and avoid confusion.

[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    8 ++++----
 kernel/sched/fair.c  |   18 ++++++++++++------
 kernel/sched/sched.h |   16 ++++++++--------
 3 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c82ca6e..349d776 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -698,12 +698,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (idle_policy(p->policy)) {
-		load->weight = scale_load(WEIGHT_IDLEPRIO);
+		load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
+	load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8184,7 +8184,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	return sched_group_set_shares(css_tg(css), user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8192,7 +8192,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 504803e..26bd0df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
 	if (likely(lw->inv_weight))
 		return;
 
-	w = scale_load_down(lw->weight);
+	w = kernel_to_user_load(lw->weight);
 
 	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 		lw->inv_weight = 1;
@@ -210,10 +210,14 @@ static void __update_inv_weight(struct load_weight *lw)
  *
  * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
  * weight/lw.weight <= 1, and therefore our shift will also be positive.
+ *
+ * Note load.weight falls back to user load scale (i.e., NICE_0's load is
+ * 1024), instead of possibly increased kernel load scale (i.e., NICE_0's
+ * load is NICE_0_LOAD) due to multiplication and division efficiency.
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
 {
-	u64 fact = scale_load_down(weight);
+	u64 fact = kernel_to_user_load(weight);
 	int shift = WMULT_SHIFT;
 
 	__update_inv_weight(lw);
@@ -7015,10 +7019,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
+		unsigned long min_cpu_load =
+			busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE;
 		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
-		if (load_above_capacity > scale_load(busiest->group_capacity))
-			load_above_capacity -=
-				scale_load(busiest->group_capacity);
+		if (load_above_capacity > min_cpu_load)
+			load_above_capacity -= min_cpu_load;
 		else
 			load_above_capacity = ~0UL;
 	}
@@ -8572,7 +8577,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+		       user_to_kernel_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a3be6f..871da67 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		(w)
-# define scale_load_down(w)	(w)
+# define user_to_kernel_load(w)	(w)
+# define kernel_to_user_load(w)	(w)
 #endif
 
 /*
  * Task weight (visible to user) and its load (invisible to user) have
  * independent resolution, but they should be well calibrated. We use
- * scale_load() and scale_load_down(w) to convert between them. The
- * following must be true:
- *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them. The following must be true:
  *
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
 #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 12/12] sched/fair: Enable increased scale for kernel load
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (10 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
@ 2016-05-02 21:54 ` Yuyang Du
  2016-05-02 22:08 ` [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 21:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli, Yuyang Du

The increased scale or precision for kernel load has been disabled
since the commit e4c2fb0d5776 ("sched: Disable (revert) SCHED_LOAD_SCALE
increase"). But we do need it when we have task groups, especially on
bigger machines. Otherwise, we probably will run out of precision for
load distribution.

So, we reinstate it and resolve to fix whatsoever power regression may
be seen.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/sched.h |   51 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 871da67..5f66a2c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,37 +42,36 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
 #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
+ * Task weight (visible and set by user) and its load (invisible to user)
+ * can have independent ranges. We increase the scale of load for 64-bit
+ * architectures. The extra precision improves share distribution and
+ * load balancing of low-weight task groups (e.g., nice +19 on an autogroup),
+ * deeper taskgroup hierarchies, especially on larger systems. This is not
+ * a user-visible change and does not change the user-interface for setting
+ * shares/weights. We increase resolution only if we have enough bits to allow
+ * this increased precision (i.e., BITS_PER_LONG > 32). The costs for increasing
+ * resolution when BITS_PER_LONG <= 32 are pretty high and the returns do not
+ * justify the increased costs.
  *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	(w)
-# define kernel_to_user_load(w)	(w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
+ * Therefore, the user load and kernel should be well expressed to make them
+ * easily exchanged. We use user_to_kernel_load() and kernel_to_user_load(w)
+ * to convert between them.
  *
+ * Following equations are a simple illustration of their relationship:
  * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
  * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
-#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
+#if defined(CONFIG_64BIT) && defined(CONFIG_FAIR_GROUP_SCHED)
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w)	(w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w)
+#define kernel_to_user_load(w)	(w)
+#endif
+
+#define NICE_0_LOAD		(1UL << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages
  2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
                   ` (11 preceding siblings ...)
  2016-05-02 21:54 ` [PATCH v2 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du
@ 2016-05-02 22:08 ` Yuyang Du
  12 siblings, 0 replies; 18+ messages in thread
From: Yuyang Du @ 2016-05-02 22:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, juri.lelli

[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]

Hi,

This patch series should have no perceivable changes to load
and util except that load's range is increased by 1024.

My initial tests suggest that. See attached figures. The workload
is running 100us out of every 200us, and 2000us out of every 8000us.
Again fixed workload, fixed CPU, and fixed frequency.
 
In addition, of course, I believe the codes should be cleaner and
more efficient after the patches.

Thanks,
Yuyang

On Tue, May 03, 2016 at 05:54:26AM +0800, Yuyang Du wrote:
> Hi Peter,
> 
> This patch series combines the previous cleanup and optimization
> series. And as you and Ingo suggested, the increased kernel load
> scale is reinstated when on 64BIT and FAIR_GROUP_SCHED. In addition
> to that, the changes include Vincent's fix, typos fixes, changelog
> and comment reword.
> 
> Thanks,
> Yuyang
> 
> Yuyang Du (12):
>   sched/fair: Optimize sum computation with a lookup table
>   sched/fair: Rename variable names for sched averages
>   sched/fair: Change the variable to hold the number of periods to
>     32bit integer
>   sched/fair: Add __always_inline compiler attribute to
>     __accumulate_sum()
>   sched/fair: Optimize __update_sched_avg()
>   documentation: Add scheduler/sched-avg.txt
>   sched/fair: Generalize the load/util averages resolution definition
>   sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
>   sched/fair: Add introduction to the sched average metrics
>   sched/fair: Remove scale_load_down() for load_avg
>   sched/fair: Rename scale_load() and scale_load_down()
>   sched/fair: Enable increased scale for kernel load
> 
>  Documentation/scheduler/sched-avg.txt |  137 ++++++++++++
>  include/linux/sched.h                 |   81 ++++++-
>  kernel/sched/core.c                   |    8 +-
>  kernel/sched/fair.c                   |  398 +++++++++++++++++----------------
>  kernel/sched/sched.h                  |   48 ++--
>  5 files changed, 439 insertions(+), 233 deletions(-)
>  create mode 100644 Documentation/scheduler/sched-avg.txt
> 
> -- 
> 1.7.9.5

[-- Attachment #2: 100_200_sched_avg.jpg --]
[-- Type: image/jpeg, Size: 35877 bytes --]

[-- Attachment #3: 2000_8000_sched_avg.jpg --]
[-- Type: image/jpeg, Size: 39046 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 03/12] sched/fair: Change the variable to hold the number of periods to 32bit integer
  2016-05-02 21:54 ` [PATCH v2 03/12] sched/fair: Change the variable to hold the number of periods to 32bit integer Yuyang Du
@ 2016-05-03  8:47   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-05-03  8:47 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, juri.lelli

On Tue, May 03, 2016 at 05:54:29AM +0800, Yuyang Du wrote:
> In sched average update, a period is about 1ms, so a 32-bit unsigned
> integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
> days, which means it is big enough and 64-bit is needless.

This fails to explain _why_ 49 days is enough. And what the 49 days is
enough for.

What happens when a task sleeps for more than 49 days?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 05/12] sched/fair: Optimize __update_sched_avg()
  2016-05-02 21:54 ` [PATCH v2 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
@ 2016-05-03  8:49   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-05-03  8:49 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, juri.lelli

On Tue, May 03, 2016 at 05:54:31AM +0800, Yuyang Du wrote:
> __update_sched_avg() has these steps:
> 1. add the left of the last incomplete period
> 2. decay old sum
> 3. accumulate new sum since last_update_time
> 4. add the current incomplete period
> 5. update averages
> 
> Previously, we separately computed steps 1, 3, and 4, leading to
> each one of them ugly in codes and costly in overhead. But actually
> they all do the same thing, so we combine them together. The result
> will be much cleaner codes and less CPU cycles.

I would very much like to see an explanation of how we can fold all that
here. Without me having to untangle the code first.

That also helps me to verify if the code does indeed implement what you
meant it to, etc..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 06/12] documentation: Add scheduler/sched-avg.txt
  2016-05-02 21:54 ` [PATCH v2 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
@ 2016-05-03  8:50   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-05-03  8:50 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, juri.lelli

On Tue, May 03, 2016 at 05:54:32AM +0800, Yuyang Du wrote:
> This doc file has the programs to generate the constants to compute
> sched averages.
> 
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  Documentation/scheduler/sched-avg.txt |  137 +++++++++++++++++++++++++++++++++
>  1 file changed, 137 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-avg.txt
> 
> diff --git a/Documentation/scheduler/sched-avg.txt b/Documentation/scheduler/sched-avg.txt
> new file mode 100644
> index 0000000..ae4132f
> --- /dev/null
> +++ b/Documentation/scheduler/sched-avg.txt
> @@ -0,0 +1,137 @@
> +The following programs are used to generate the constants for
> +computing sched averages.
> +
> +==============================================================
> +		C program (compile with -lm)
> +==============================================================

> +==============================================================
> +		Python script if you speak snake
> +==============================================================

Please do not put two programs in one file. Also, I think I would prefer
it to be one program; having two just means they can drift apart and
create confusion.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip:sched/core] sched/fair: Optimize sum computation with a lookup table
  2016-05-02 21:54 ` [PATCH v2 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
@ 2016-05-05  9:41   ` tip-bot for Yuyang Du
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot for Yuyang Du @ 2016-05-05  9:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, efault, hpa, vincent.guittot, yuyang.du, torvalds,
	linux-kernel, tglx, morten.rasmussen

Commit-ID:  7b20b916e953cabef569541f991a0a583bc344cb
Gitweb:     http://git.kernel.org/tip/7b20b916e953cabef569541f991a0a583bc344cb
Author:     Yuyang Du <yuyang.du@intel.com>
AuthorDate: Tue, 3 May 2016 05:54:27 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 09:41:08 +0200

sched/fair: Optimize sum computation with a lookup table

__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant amount of time.

The program to generate the constants is located at:

  Documentation/scheduler/sched-avg.txt

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@arm.com
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1462226078-31904-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e148571..8c381a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2603,6 +2603,16 @@ static const u32 runnable_avg_yN_sum[] = {
 };
 
 /*
+ * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
+ * lower integers. See Documentation/scheduler/sched-avg.txt how these
+ * were generated:
+ */
+static const u32 __accumulated_sum_N32[] = {
+	    0, 23371, 35056, 40899, 43820, 45281,
+	46011, 46376, 46559, 46650, 46696, 46719,
+};
+
+/*
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
@@ -2650,14 +2660,9 @@ static u32 __compute_runnable_contrib(u64 n)
 	else if (unlikely(n >= LOAD_AVG_MAX_N))
 		return LOAD_AVG_MAX;
 
-	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
-	do {
-		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
-		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
-
-		n -= LOAD_AVG_PERIOD;
-	} while (n > LOAD_AVG_PERIOD);
-
+	/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
+	contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
+	n %= LOAD_AVG_PERIOD;
 	contrib = decay_load(contrib, n);
 	return contrib + runnable_avg_yN_sum[n];
 }

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-05-05  9:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-02 21:54 [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
2016-05-02 21:54 ` [PATCH v2 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
2016-05-05  9:41   ` [tip:sched/core] " tip-bot for Yuyang Du
2016-05-02 21:54 ` [PATCH v2 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
2016-05-02 21:54 ` [PATCH v2 03/12] sched/fair: Change the variable to hold the number of periods to 32bit integer Yuyang Du
2016-05-03  8:47   ` Peter Zijlstra
2016-05-02 21:54 ` [PATCH v2 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
2016-05-02 21:54 ` [PATCH v2 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
2016-05-03  8:49   ` Peter Zijlstra
2016-05-02 21:54 ` [PATCH v2 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
2016-05-03  8:50   ` Peter Zijlstra
2016-05-02 21:54 ` [PATCH v2 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
2016-05-02 21:54 ` [PATCH v2 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
2016-05-02 21:54 ` [PATCH v2 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
2016-05-02 21:54 ` [PATCH v2 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-05-02 21:54 ` [PATCH v2 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-05-02 21:54 ` [PATCH v2 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du
2016-05-02 22:08 ` [PATCH v2 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).