linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] sched/fair: Clean up sched metric definitions
@ 2016-04-05  4:12 Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Hi Peter,

Would you please give it a look?

This series cleans up the sched metrics, changes include:
(1) Define SCHED_FIXEDPOINT_SHIFT for all fixed point arithmetic scaling.
(2) Get rid of confusing scaling factors: SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE,
    and thus only leave NICE_0_LOAD (for load) and SCHED_CAPACITY_SCALE (for util).
(3) Consistently use SCHED_CAPACITY_SCALE for all util related.
(4) Add detailed introduction to the sched metrics.
(5) Get rid of unnecessary scaling up and down for load.
(6) Rename the mappings between priority (user) and load (kernel).
(7) Move inactive code.

The previous version is at: http://thread.gmane.org/gmane.linux.kernel/2187272

v3 changes:
(1) Rebase to current tip
(2) Changelog fix, thanks to Ben.

Thanks,
Yuyang

---

Yuyang Du (6):
  sched/fair: Generalize the load/util averages resolution definition
  sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  sched/fair: Add introduction to the sched load avg metrics
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Move (inactive) option from code to config

 include/linux/sched.h | 81 +++++++++++++++++++++++++++++++++++++++++++--------
 init/Kconfig          | 16 ++++++++++
 kernel/sched/core.c   |  8 ++---
 kernel/sched/fair.c   | 33 ++++++++++-----------
 kernel/sched/sched.h  | 52 +++++++++++++++------------------
 5 files changed, 127 insertions(+), 63 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition
  2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
@ 2016-04-05  4:12 ` Yuyang Du
  2016-05-05  9:39   ` [tip:sched/core] " tip-bot for Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h | 16 +++++++++++++---
 kernel/sched/fair.c   |  4 ----
 kernel/sched/sched.h  | 15 ++++++++++-----
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e848c..bc652fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -935,9 +935,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT	10
+# define SCHED_FIXEDPOINT_SCALE	(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1203,8 +1213,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8cc1c3..88ab334 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2662,10 +2662,6 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7cbad7..03ce2de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
@ 2016-04-05  4:12 ` Yuyang Du
  2016-05-05  9:40   ` [tip:sched/core] sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE tip-bot for Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 22 +++++++++++-----------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 88ab334..aca96d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -719,7 +719,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
-	long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -6954,7 +6954,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
+				      scale_load_down(NICE_0_LOAD);
 		if (load_above_capacity > busiest->group_capacity)
 			load_above_capacity -= busiest->group_capacity;
 		else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 03ce2de..6aafe6c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOAD		SCHED_LOAD_SCALE
-#define NICE_0_SHIFT		SCHED_LOAD_SHIFT
+#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -859,7 +859,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
 	atomic_t ref;
 	/*
-	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
 	unsigned int capacity;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/6] sched/fair: Add introduction to the sched load avg metrics
  2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
@ 2016-04-05  4:12 ` Yuyang Du
  2016-05-05  9:41   ` [tip:sched/core] sched/fair: Add detailed description " tip-bot for Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

These sched metrics have become complex enough. We introduce them
at their definition.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc652fe..b0a6cf0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1209,18 +1209,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_load_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *    Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
 	u64 last_update_time, load_sum;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg
  2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
                   ` (2 preceding siblings ...)
  2016-04-05  4:12 ` [PATCH v3 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
@ 2016-04-05  4:12 ` Yuyang Du
  2016-04-28 10:25   ` Peter Zijlstra
  2016-04-05  4:12 ` [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
  2016-04-05  4:12 ` [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config Yuyang Du
  5 siblings, 1 reply; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased load range. In that case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of day, on the condition where it occurs, the
load average will not be useful anyway.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 19 ++++++++++++++-----
 kernel/sched/fair.c   | 19 +++++++++----------
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0a6cf0..8d2e8f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1214,7 +1214,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1222,7 +1222,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalize to the highest frequency
  *
@@ -1248,9 +1248,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32bit kernel), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aca96d7..2be613b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * will definitely be update (after enqueue).
 	 */
 	sa->period_contrib = 1023;
-	sa->load_avg = scale_load_down(se->load.weight);
+	sa->load_avg = se->load.weight;
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
@@ -2890,7 +2890,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+		cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -2912,8 +2912,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
+	__update_load_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
 			  cfs_rq->curr == se, NULL);
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
@@ -2973,7 +2972,7 @@ skip_aging:
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			  &se->avg, se->on_rq * se->load.weight,
 			  cfs_rq->curr == se, NULL);
 
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
@@ -2993,7 +2992,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	migrated = !sa->last_update_time;
 	if (!migrated) {
 		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
+			se->on_rq * se->load.weight,
 			cfs_rq->curr == se, NULL);
 	}
 
@@ -6953,10 +6952,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-				      scale_load_down(NICE_0_LOAD);
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
+		if (load_above_capacity > scale_load(busiest->group_capacity))
+			load_above_capacity -=
+				scale_load(busiest->group_capacity);
 		else
 			load_above_capacity = ~0UL;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()
  2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
                   ` (3 preceding siblings ...)
  2016-04-05  4:12 ` [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-04-05  4:12 ` Yuyang Du
  2016-04-28  9:19   ` Peter Zijlstra
  2016-04-05  4:12 ` [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config Yuyang Du
  5 siblings, 1 reply; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively, to allow the names to bear
what they are really about.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |  8 ++++----
 kernel/sched/fair.c  | 14 ++++++++------
 kernel/sched/sched.h | 16 ++++++++--------
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..21c08c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -695,12 +695,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (idle_policy(p->policy)) {
-		load->weight = scale_load(WEIGHT_IDLEPRIO);
+		load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
+	load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8180,7 +8180,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	return sched_group_set_shares(css_tg(css), user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8188,7 +8188,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2be613b..0b6659d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
 	if (likely(lw->inv_weight))
 		return;
 
-	w = scale_load_down(lw->weight);
+	w = kernel_to_user_load(lw->weight);
 
 	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 		lw->inv_weight = 1;
@@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
 {
-	u64 fact = scale_load_down(weight);
+	u64 fact = kernel_to_user_load(weight);
 	int shift = WMULT_SHIFT;
 
 	__update_inv_weight(lw);
@@ -6952,10 +6952,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
+		unsigned long min_cpu_load =
+			kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
 		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
-		if (load_above_capacity > scale_load(busiest->group_capacity))
-			load_above_capacity -=
-				scale_load(busiest->group_capacity);
+		if (load_above_capacity > min_cpu_load)
+			load_above_capacity -= min_cpu_load;
 		else
 			load_above_capacity = ~0UL;
 	}
@@ -8510,7 +8511,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+		       user_to_kernel_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6aafe6c..b00e6e5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		(w)
-# define scale_load_down(w)	(w)
+# define user_to_kernel_load(w)	(w)
+# define kernel_to_user_load(w)	(w)
 #endif
 
 /*
  * Task weight (visible to user) and its load (invisible to user) have
  * independent resolution, but they should be well calibrated. We use
- * scale_load() and scale_load_down(w) to convert between them. The
- * following must be true:
- *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them. The following must be true:
  *
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
 #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config
  2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
                   ` (4 preceding siblings ...)
  2016-04-05  4:12 ` [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
@ 2016-04-05  4:12 ` Yuyang Du
  2016-04-28  9:37   ` Peter Zijlstra
  5 siblings, 1 reply; 19+ messages in thread
From: Yuyang Du @ 2016-04-05  4:12 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

The option of increased load resolution (fixed point arithmetic range) is
unconditionally deactivated with #if 0. But since it may still be used
somewhere (e.g., in Google), we want to keep this option.

Regardless, there should be a way to express this option. Considering the
current circumstances, the reconciliation is we define a config
CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
64BIT and BROKEN.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 init/Kconfig         | 16 +++++++++++++++
 kernel/sched/sched.h | 55 +++++++++++++++++++++-------------------------------
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d..ad75ff7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1026,6 +1026,22 @@ config CFS_BANDWIDTH
 	  restriction.
 	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
 
+config CFS_INCREASE_LOAD_RANGE
+	bool "Increase kernel load range"
+	depends on 64BIT && BROKEN
+	default n
+	help
+	  Increase resolution of nice-level calculations for 64-bit architectures.
+	  The extra resolution improves shares distribution and load balancing of
+	  low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+	  hierarchies, especially on larger systems. This is not a user-visible change
+	  and does not change the user-interface for setting shares/weights.
+	  We increase resolution only if we have enough bits to allow this increased
+	  resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+	  when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+	  increased costs.
+	  Currently broken: it increases power usage under light load.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on CGROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b00e6e5..aafb3e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,39 +42,6 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
 #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
- *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	(w)
-# define kernel_to_user_load(w)	(w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
- *
- * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
- * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
- */
-#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
-
-/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -1150,6 +1117,28 @@ extern const int sched_prio_to_weight[40];
 extern const u32 sched_prio_to_wmult[40];
 
 /*
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them.
+ *
+ * The following must also be true:
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
+ */
+#ifdef CONFIG_CFS_INCREASE_LOAD_RANGE
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w)	(w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w)
+#define kernel_to_user_load(w)	(w)
+#endif
+
+#define NICE_0_LOAD		(1UL << NICE_0_LOAD_SHIFT)
+
+/*
  * {de,en}queue flags:
  *
  * DEQUEUE_SLEEP  - task is no longer runnable
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg
  2016-04-28 10:25   ` Peter Zijlstra
@ 2016-04-28  3:01     ` Yuyang Du
  2016-04-28 19:29     ` Yuyang Du
  1 sibling, 0 replies; 19+ messages in thread
From: Yuyang Du @ 2016-04-28  3:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Thu, Apr 28, 2016 at 12:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> > Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> > down of load does not make much sense, because load_avg is primarily THE
> > load and on top of that, we take runnable time into account.
> > 
> > We therefore remove scale_load_down() for load_avg. But we need to
> > carefully consider the overflow risk if load has higher range
> > (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> > to us is on 64bit kernel with increased load range. In that case,
> > the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> > entities with the highest load (=88761*1024) always runnable on one
> > single cfs_rq, which may be an issue, but should be fine. Even if this
> > occurs at the end of day, on the condition where it occurs, the
> > load average will not be useful anyway.
> 
> I do feel we need a little more words on the actual ramification of
> overflowing here.
> 
> Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
> if it happens, then what will the user experience? How long (if ever)
> does it take for numbers to correct themselves etc..
> 
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> > [update calculate_imbalance]
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> This SoB Chain suggests you wrote it and Vincent send it on, yet this
> email is from you and Vincent isn't anywhere. Something's not right.

Since you started to review patches, I just sent you more, :) What a coincidance.

I actually don't know the rules for this SoB, let me learn how to do this
co-signed-off.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()
  2016-04-05  4:12 ` [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
@ 2016-04-28  9:19   ` Peter Zijlstra
  2016-04-28 11:18     ` Vincent Guittot
  2016-04-28 20:30     ` Yuyang Du
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-04-28  9:19 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> Rename scale_load() and scale_load_down() to user_to_kernel_load()
> and kernel_to_user_load() respectively, to allow the names to bear
> what they are really about.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
>  	if (likely(lw->inv_weight))
>  		return;
>  
> -	w = scale_load_down(lw->weight);
> +	w = kernel_to_user_load(lw->weight);
>  
>  	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
>  		lw->inv_weight = 1;
> @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
>   */
>  static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
>  {
> -	u64 fact = scale_load_down(weight);
> +	u64 fact = kernel_to_user_load(weight);
>  	int shift = WMULT_SHIFT;
>  
>  	__update_inv_weight(lw);
> @@ -6952,10 +6952,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	 */
>  	if (busiest->group_type == group_overloaded &&
>  	    local->group_type   == group_overloaded) {
> +		unsigned long min_cpu_load =
> +			kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
>  		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
> -		if (load_above_capacity > scale_load(busiest->group_capacity))
> -			load_above_capacity -=
> -				scale_load(busiest->group_capacity);
> +		if (load_above_capacity > min_cpu_load)
> +			load_above_capacity -= min_cpu_load;
>  		else
>  			load_above_capacity = ~0UL;
>  	}

Except these 3 really are not about user/kernel visible fixed point
ranges _at_all_... :/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config
  2016-04-05  4:12 ` [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config Yuyang Du
@ 2016-04-28  9:37   ` Peter Zijlstra
  2016-04-28  9:45     ` Ingo Molnar
  2016-04-28 20:34     ` Yuyang Du
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-04-28  9:37 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> The option of increased load resolution (fixed point arithmetic range) is
> unconditionally deactivated with #if 0. But since it may still be used
> somewhere (e.g., in Google), we want to keep this option.
> 
> Regardless, there should be a way to express this option. Considering the
> current circumstances, the reconciliation is we define a config
> CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> 64BIT and BROKEN.
> 
> Suggested-by: Ingo Molnar <mingo@kernel.org>

So I'm very tempted to simply, unconditionally, reinstate this larger
range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.

There was but the single claim on increased power usage, nobody could
reproduce / analyze and Google has been running with this for years now.

Furthermore, it seems to be leading to the obvious problems on bigger
machines where we basically run out of precision by the sheer number of
cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config
  2016-04-28  9:37   ` Peter Zijlstra
@ 2016-04-28  9:45     ` Ingo Molnar
  2016-04-28 20:34     ` Yuyang Du
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2016-04-28  9:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> > The option of increased load resolution (fixed point arithmetic range) is
> > unconditionally deactivated with #if 0. But since it may still be used
> > somewhere (e.g., in Google), we want to keep this option.
> > 
> > Regardless, there should be a way to express this option. Considering the
> > current circumstances, the reconciliation is we define a config
> > CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> > 64BIT and BROKEN.
> > 
> > Suggested-by: Ingo Molnar <mingo@kernel.org>
> 
> So I'm very tempted to simply, unconditionally, reinstate this larger
> range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.
> 
> There was but the single claim on increased power usage, nobody could
> reproduce / analyze and Google has been running with this for years now.
> 
> Furthermore, it seems to be leading to the obvious problems on bigger
> machines where we basically run out of precision by the sheer number of
> cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).

Agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg
  2016-04-05  4:12 ` [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-04-28 10:25   ` Peter Zijlstra
  2016-04-28  3:01     ` Yuyang Du
  2016-04-28 19:29     ` Yuyang Du
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-04-28 10:25 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> down of load does not make much sense, because load_avg is primarily THE
> load and on top of that, we take runnable time into account.
> 
> We therefore remove scale_load_down() for load_avg. But we need to
> carefully consider the overflow risk if load has higher range
> (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> to us is on 64bit kernel with increased load range. In that case,
> the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> entities with the highest load (=88761*1024) always runnable on one
> single cfs_rq, which may be an issue, but should be fine. Even if this
> occurs at the end of day, on the condition where it occurs, the
> load average will not be useful anyway.

I do feel we need a little more words on the actual ramification of
overflowing here.

Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
if it happens, then what will the user experience? How long (if ever)
does it take for numbers to correct themselves etc..

> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> [update calculate_imbalance]
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

This SoB Chain suggests you wrote it and Vincent send it on, yet this
email is from you and Vincent isn't anywhere. Something's not right.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()
  2016-04-28  9:19   ` Peter Zijlstra
@ 2016-04-28 11:18     ` Vincent Guittot
  2016-04-28 20:30     ` Yuyang Du
  1 sibling, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2016-04-28 11:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	dietmar.eggemann, lizefan, umgwanakikbuti

Le Thursday 28 Apr 2016 à 11:19:19 (+0200), Peter Zijlstra a écrit :
> On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> > Rename scale_load() and scale_load_down() to user_to_kernel_load()
> > and kernel_to_user_load() respectively, to allow the names to bear
> > what they are really about.
> 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
> >  	if (likely(lw->inv_weight))
> >  		return;
> >  
> > -	w = scale_load_down(lw->weight);
> > +	w = kernel_to_user_load(lw->weight);
> >  
> >  	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> >  		lw->inv_weight = 1;
> > @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
> >   */
> >  static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
> >  {
> > -	u64 fact = scale_load_down(weight);
> > +	u64 fact = kernel_to_user_load(weight);
> >  	int shift = WMULT_SHIFT;
> >  
> >  	__update_inv_weight(lw);
> > @@ -6952,10 +6952,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  	 */
> >  	if (busiest->group_type == group_overloaded &&
> >  	    local->group_type   == group_overloaded) {
> > +		unsigned long min_cpu_load =
> > +			kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
> >  		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
> > -		if (load_above_capacity > scale_load(busiest->group_capacity))
> > -			load_above_capacity -=
> > -				scale_load(busiest->group_capacity);
> > +		if (load_above_capacity > min_cpu_load)
> > +			load_above_capacity -= min_cpu_load;
> >  		else
> >  			load_above_capacity = ~0UL;
> >  	}
> 
> Except these 3 really are not about user/kernel visible fixed point
> ranges _at_all_... :/

While trying to optimize the calcultaion of min_cpu_load, i have broken evrything

it should be :

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b6659d..3411eb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6953,7 +6953,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		unsigned long min_cpu_load =
-			kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
+			busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE;
 		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
 		if (load_above_capacity > min_cpu_load)
 			load_above_capacity -= min_cpu_load;


>
>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg
  2016-04-28 10:25   ` Peter Zijlstra
  2016-04-28  3:01     ` Yuyang Du
@ 2016-04-28 19:29     ` Yuyang Du
  1 sibling, 0 replies; 19+ messages in thread
From: Yuyang Du @ 2016-04-28 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Thu, Apr 28, 2016 at 12:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> > Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> > down of load does not make much sense, because load_avg is primarily THE
> > load and on top of that, we take runnable time into account.
> > 
> > We therefore remove scale_load_down() for load_avg. But we need to
> > carefully consider the overflow risk if load has higher range
> > (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> > to us is on 64bit kernel with increased load range. In that case,
> > the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> > entities with the highest load (=88761*1024) always runnable on one
> > single cfs_rq, which may be an issue, but should be fine. Even if this
> > occurs at the end of day, on the condition where it occurs, the
> > load average will not be useful anyway.
> 
> I do feel we need a little more words on the actual ramification of
> overflowing here.
> 
> Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
> if it happens, then what will the user experience? How long (if ever)
> does it take for numbers to correct themselves etc..

Well, regarding the experience, this should be a stress test study.

But if the system can miraculously survive, and we end up in the scenario
that we have a ~0ULL load_sum and the rq suddently dropps to 0 load, it
would take roughly 2 seconds (=32ms*64) to converge. This time is the bound.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()
  2016-04-28  9:19   ` Peter Zijlstra
  2016-04-28 11:18     ` Vincent Guittot
@ 2016-04-28 20:30     ` Yuyang Du
  1 sibling, 0 replies; 19+ messages in thread
From: Yuyang Du @ 2016-04-28 20:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Thu, Apr 28, 2016 at 11:19:19AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> > Rename scale_load() and scale_load_down() to user_to_kernel_load()
> > and kernel_to_user_load() respectively, to allow the names to bear
> > what they are really about.
> 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
> >  	if (likely(lw->inv_weight))
> >  		return;
> >  
> > -	w = scale_load_down(lw->weight);
> > +	w = kernel_to_user_load(lw->weight);
> >  
> >  	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> >  		lw->inv_weight = 1;
> > @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
> >   */
> >  static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
> >  {
> > -	u64 fact = scale_load_down(weight);
> > +	u64 fact = kernel_to_user_load(weight);
> >  	int shift = WMULT_SHIFT;
> >  
> >  	__update_inv_weight(lw);

[snip]
 
> Except these 3 really are not about user/kernel visible fixed point
> ranges _at_all_... :/

But are the above two falling back to user fixed point precision? And
the reason being that we can't efficiently do this multiply/divide
thing with increased fixed point for kernel load.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config
  2016-04-28  9:37   ` Peter Zijlstra
  2016-04-28  9:45     ` Ingo Molnar
@ 2016-04-28 20:34     ` Yuyang Du
  1 sibling, 0 replies; 19+ messages in thread
From: Yuyang Du @ 2016-04-28 20:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, bsegall, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Thu, Apr 28, 2016 at 11:37:33AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> > The option of increased load resolution (fixed point arithmetic range) is
> > unconditionally deactivated with #if 0. But since it may still be used
> > somewhere (e.g., in Google), we want to keep this option.
> > 
> > Regardless, there should be a way to express this option. Considering the
> > current circumstances, the reconciliation is we define a config
> > CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> > 64BIT and BROKEN.
> > 
> > Suggested-by: Ingo Molnar <mingo@kernel.org>
> 
> So I'm very tempted to simply, unconditionally, reinstate this larger
> range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.
> 
> There was but the single claim on increased power usage, nobody could
> reproduce / analyze and Google has been running with this for years now.
> 
> Furthermore, it seems to be leading to the obvious problems on bigger
> machines where we basically run out of precision by the sheer number of
> cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).
 
Great.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/fair: Generalize the load/util averages resolution definition
  2016-04-05  4:12 ` [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
@ 2016-05-05  9:39   ` tip-bot for Yuyang Du
  0 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Yuyang Du @ 2016-05-05  9:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, linux-kernel, torvalds, efault, mingo, peterz, yuyang.du

Commit-ID:  6ecdd74962f246dfe8750b7bea481a1c0816315d
Gitweb:     http://git.kernel.org/tip/6ecdd74962f246dfe8750b7bea481a1c0816315d
Author:     Yuyang Du <yuyang.du@intel.com>
AuthorDate: Tue, 5 Apr 2016 12:12:26 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 09:24:00 +0200

sched/fair: Generalize the load/util averages resolution definition

Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 16 +++++++++++++---
 kernel/sched/fair.c   |  4 ----
 kernel/sched/sched.h  | 15 ++++++++++-----
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d894f2d..7d779d7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,9 +937,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT	10
+# define SCHED_FIXEDPOINT_SCALE	(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1205,8 +1215,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 91395e1..76ca86e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2662,10 +2662,6 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 066a4c2..ad83361 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -56,18 +56,23 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increase coverage and consistency always enable it on 64bit platforms.
  */
 #ifdef CONFIG_64BIT
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to users) and its load (invisible to users) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, and the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE
  2016-04-05  4:12 ` [PATCH v3 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
@ 2016-05-05  9:40   ` tip-bot for Yuyang Du
  0 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Yuyang Du @ 2016-05-05  9:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, peterz, yuyang.du, torvalds, bsegall, hpa,
	tglx, efault

Commit-ID:  172895e6b5216eba3e0880460829a8baeefd55f3
Gitweb:     http://git.kernel.org/tip/172895e6b5216eba3e0880460829a8baeefd55f3
Author:     Yuyang Du <yuyang.du@intel.com>
AuthorDate: Tue, 5 Apr 2016 12:12:27 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 09:35:21 +0200

sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE

After cleaning up the sched metrics, there are two definitions that are
ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT.

Resolve this:

 - Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what
   it is.

 - Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE.

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com
[ Rewrote the changelog and fixed the build on 32-bit kernels. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 22 +++++++++++-----------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76ca86e..e148571 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -719,7 +719,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
-	long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -7010,7 +7010,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
+				      scale_load_down(NICE_0_LOAD);
 		if (load_above_capacity > busiest->group_capacity)
 			load_above_capacity -= busiest->group_capacity;
 		else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ad83361..d24e91b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -56,25 +56,25 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * increase coverage and consistency always enable it on 64bit platforms.
  */
 #ifdef CONFIG_64BIT
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to users) and its load (invisible to users) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, and the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to users) and its load (invisible to users) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOAD		SCHED_LOAD_SCALE
-#define NICE_0_SHIFT		SCHED_LOAD_SHIFT
+#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -863,7 +863,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
 	atomic_t ref;
 	/*
-	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
 	unsigned int capacity;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/fair: Add detailed description to the sched load avg metrics
  2016-04-05  4:12 ` [PATCH v3 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
@ 2016-05-05  9:41   ` tip-bot for Yuyang Du
  0 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Yuyang Du @ 2016-05-05  9:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, yuyang.du, torvalds, hpa, efault, tglx, peterz, mingo

Commit-ID:  7b5953345efe4f226bb52cbea04558d16ec7ebfa
Gitweb:     http://git.kernel.org/tip/7b5953345efe4f226bb52cbea04558d16ec7ebfa
Author:     Yuyang Du <yuyang.du@intel.com>
AuthorDate: Tue, 5 Apr 2016 12:12:28 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 09:41:08 +0200

sched/fair: Add detailed description to the sched load avg metrics

These sched metrics have become complex enough, so describe them
in detail at their definition.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Fixed the text to improve its spelling and typography. ]
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-4-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d779d7..57faf78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,18 +1211,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_load_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ *   load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ *   load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalized to the highest frequency.
+ *
+ * [util_avg definition]
+ *
+ *   util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ *   util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetics,
+ * we therefore scale them to as large a range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761), always runnable on a single cfs_rq,
+ * and should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32-bit kernels), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *    Max(load_avg) <= Max(load.weight)
+ *
+ * Then it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
 	u64 last_update_time, load_sum;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-05-05  9:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-05  4:12 [PATCH v3 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
2016-04-05  4:12 ` [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
2016-05-05  9:39   ` [tip:sched/core] " tip-bot for Yuyang Du
2016-04-05  4:12 ` [PATCH v3 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
2016-05-05  9:40   ` [tip:sched/core] sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE tip-bot for Yuyang Du
2016-04-05  4:12 ` [PATCH v3 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
2016-05-05  9:41   ` [tip:sched/core] sched/fair: Add detailed description " tip-bot for Yuyang Du
2016-04-05  4:12 ` [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-04-28 10:25   ` Peter Zijlstra
2016-04-28  3:01     ` Yuyang Du
2016-04-28 19:29     ` Yuyang Du
2016-04-05  4:12 ` [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-04-28  9:19   ` Peter Zijlstra
2016-04-28 11:18     ` Vincent Guittot
2016-04-28 20:30     ` Yuyang Du
2016-04-05  4:12 ` [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config Yuyang Du
2016-04-28  9:37   ` Peter Zijlstra
2016-04-28  9:45     ` Ingo Molnar
2016-04-28 20:34     ` Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).