[PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions
@ 2016-03-30 20:16 Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Hi Peter,

This patch searies is left in last year, and thus I resend it. Would you
please give it a look?

The previous version is at http://thread.gmane.org/gmane.linux.kernel/2068513

This series cleans up the sched metrics, changes include:
(1) Define SCHED_FIXEDPOINT_SHIFT for all fixed point arithmetic scaling.
(2) Get rid of confusing scaling factors: SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE,
    and thus only levae NICE_0_LOAD (for load) and SCHED_CAPACITY_SCALE (for util).
(3) Consistently use SCHED_CAPACITY_SCALE for util related.
(4) Add more detailed introduction to the sched metrics.
(5) Get rid of unnecessary extra scaling up and down for load.
(6) Rename the mappings between priority (user) and load (kernel).
(7) Remove/replace inactive code.

So, except for (5), we did not change any logic. Per request by Ingo, I checked
the disassembly of kernel/sched/built-in.o before vs. after the patches. But
since the very first patch to the end, there are a bunch of "offset" changes,
all like the pattern:

     60e3:      eb 21                   jmp    6106 <rq_clock+0x7c>
-    60e5:      be db 02 00 00          mov    $0x2db,%esi
+    60e5:      be e0 02 00 00          mov    $0x2e0,%esi

I have no idea what is changed, but venture a guess, code layout changed a bit?

Anyway, thanks a lot to Ben, Morten, Dietmar, Vincent, and others who provided
valuable comments.

v2 changes:
- Rename SCHED_RESOLUTION_SHIFT to SCHED_FIXEDPOINT_SHIFT, thanks to Peter
- Fix bugs in calculate_imbalance(), thanks to Vincent
- Fix "#if 0" for increased kernel load, suggested by Ingo

Thanks,
Yuyang

Yuyang Du (6):
  sched/fair: Generalize the load/util averages resolution definition
  sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  sched/fair: Add introduction to the sched load avg metrics
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Remove unconditionally inactive code

 include/linux/sched.h | 81 +++++++++++++++++++++++++++++++++++++++++++--------
 init/Kconfig          | 16 ++++++++++
 kernel/sched/core.c   |  8 ++---
 kernel/sched/fair.c   | 33 ++++++++++-----------
 kernel/sched/sched.h  | 52 +++++++++++++++------------------
 5 files changed, 127 insertions(+), 63 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition
  2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
@ 2016-03-30 20:16 ` Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h | 16 +++++++++++++---
 kernel/sched/fair.c   |  4 ----
 kernel/sched/sched.h  | 15 ++++++++++-----
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c617ea1..54784d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -934,9 +934,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT	10
+# define SCHED_FIXEDPOINT_SCALE	(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1202,8 +1212,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 303d639..1d3fc01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2609,10 +2609,6 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e6d4a3f..15a89ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RESEND v2 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
@ 2016-03-30 20:16 ` Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 22 +++++++++++-----------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d3fc01..bf835b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
-	sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
+	sa->util_avg = SCHED_CAPACITY_SCALE;
 	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -6877,7 +6877,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
 		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
+				      scale_load_down(NICE_0_LOAD);
 		if (load_above_capacity > busiest->group_capacity)
 			load_above_capacity -= busiest->group_capacity;
 		else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15a89ee..94ba652 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOAD		SCHED_LOAD_SCALE
-#define NICE_0_SHIFT		SCHED_LOAD_SHIFT
+#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -859,7 +859,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
 	atomic_t ref;
 	/*
-	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
 	unsigned int capacity;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RESEND v2 3/6] sched/fair: Add introduction to the sched load avg metrics
  2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
@ 2016-03-30 20:16 ` Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

These sched metrics have become complex enough. We introduce them
at their definition.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 54784d0..db3c6e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1208,18 +1208,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_load_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *    Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
 	u64 last_update_time, load_sum;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg
  2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
                   ` (2 preceding siblings ...)
  2016-03-30 20:16 ` [PATCH RESEND v2 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
@ 2016-03-30 20:16 ` Yuyang Du
  2016-03-31 17:45   ` bsegall
  2016-03-30 20:16 ` [PATCH RESEND v2 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code Yuyang Du
  5 siblings, 1 reply; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased load range. In that case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of day, on the condition where it occurs, the
load average will not be useful anyway.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 19 ++++++++++++++-----
 kernel/sched/fair.c   | 19 +++++++++----------
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index db3c6e1..8df6d69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1213,7 +1213,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1221,7 +1221,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalize to the highest frequency
  *
@@ -1247,9 +1247,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32bit kernel), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf835b5..da6642f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * will definitely be update (after enqueue).
 	 */
 	sa->period_contrib = 1023;
-	sa->load_avg = scale_load_down(se->load.weight);
+	sa->load_avg = se->load.weight;
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 	sa->util_avg = SCHED_CAPACITY_SCALE;
 	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
@@ -2837,7 +2837,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+		cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -2858,8 +2858,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
+	__update_load_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
 			  cfs_rq->curr == se, NULL);
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
@@ -2896,7 +2895,7 @@ skip_aging:
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
+			  &se->avg, se->on_rq * se->load.weight,
 			  cfs_rq->curr == se, NULL);
 
 	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
@@ -2916,7 +2915,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	migrated = !sa->last_update_time;
 	if (!migrated) {
 		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
+			se->on_rq * se->load.weight,
 			cfs_rq->curr == se, NULL);
 	}
 
@@ -6876,10 +6875,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-				      scale_load_down(NICE_0_LOAD);
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
+		if (load_above_capacity > scale_load(busiest->group_capacity))
+			load_above_capacity -=
+				scale_load(busiest->group_capacity);
 		else
 			load_above_capacity = ~0UL;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RESEND v2 5/6] sched/fair: Rename scale_load() and scale_load_down()
  2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
                   ` (3 preceding siblings ...)
  2016-03-30 20:16 ` [PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-03-30 20:16 ` Yuyang Du
  2016-03-30 20:16 ` [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code Yuyang Du
  5 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively, to allow the names to bear
what they are really about.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
[update calculate_imbalance]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |  8 ++++----
 kernel/sched/fair.c  | 14 ++++++++------
 kernel/sched/sched.h | 16 ++++++++--------
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b21e7a..81c876e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -677,12 +677,12 @@ static void set_load_weight(struct task_struct *p)
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
 	if (idle_policy(p->policy)) {
-		load->weight = scale_load(WEIGHT_IDLEPRIO);
+		load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
+	load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8110,7 +8110,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	return sched_group_set_shares(css_tg(css), user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8118,7 +8118,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da6642f..bcf1027 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
 	if (likely(lw->inv_weight))
 		return;
 
-	w = scale_load_down(lw->weight);
+	w = kernel_to_user_load(lw->weight);
 
 	if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
 		lw->inv_weight = 1;
@@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
 {
-	u64 fact = scale_load_down(weight);
+	u64 fact = kernel_to_user_load(weight);
 	int shift = WMULT_SHIFT;
 
 	__update_inv_weight(lw);
@@ -6875,10 +6875,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
+		unsigned long min_cpu_load =
+			kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
 		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
-		if (load_above_capacity > scale_load(busiest->group_capacity))
-			load_above_capacity -=
-				scale_load(busiest->group_capacity);
+		if (load_above_capacity > min_cpu_load)
+			load_above_capacity -= min_cpu_load;
 		else
 			load_above_capacity = ~0UL;
 	}
@@ -8432,7 +8433,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+		       user_to_kernel_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 94ba652..ebe16e3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w)		(w)
-# define scale_load_down(w)	(w)
+# define user_to_kernel_load(w)	(w)
+# define kernel_to_user_load(w)	(w)
 #endif
 
 /*
  * Task weight (visible to user) and its load (invisible to user) have
  * independent resolution, but they should be well calibrated. We use
- * scale_load() and scale_load_down(w) to convert between them. The
- * following must be true:
- *
- *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them. The following must be true:
  *
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
  */
 #define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code
  2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
                   ` (4 preceding siblings ...)
  2016-03-30 20:16 ` [PATCH RESEND v2 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
@ 2016-03-30 20:16 ` Yuyang Du
  2016-03-31 17:53   ` bsegall
  5 siblings, 1 reply; 10+ messages in thread
From: Yuyang Du @ 2016-03-30 20:16 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, lizefan, umgwanakikbuti, Yuyang Du

The increased load resolution (fixed point arithmetic range) is
unconditionally deactivated with #if 0, so it is effectively broken.

But the increased load range is still used somewhere (e.g., in Google),
so we keep this feature. The reconciliation is we define
CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
64BIT and BROKEN.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 init/Kconfig         | 16 +++++++++++++++
 kernel/sched/sched.h | 55 +++++++++++++++++++++-------------------------------
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 2232080..d072c09 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1025,6 +1025,22 @@ config CFS_BANDWIDTH
 	  restriction.
 	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
 
+config CFS_INCREASE_LOAD_RANGE
+	bool "Increase kernel load range"
+	depends on 64BIT && BROKEN
+	default n
+	help
+	  Increase resolution of nice-level calculations for 64-bit architectures.
+	  The extra resolution improves shares distribution and load balancing of
+	  low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+	  hierarchies, especially on larger systems. This is not a user-visible change
+	  and does not change the user-interface for setting shares/weights.
+	  We increase resolution only if we have enough bits to allow this increased
+	  resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+	  when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+	  increased costs.
+	  Currently broken: it increases power usage under light load.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on CGROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ebe16e3..1bb0d69 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,39 +42,6 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
 #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
- *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)	(w)
-# define kernel_to_user_load(w)	(w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
- *
- * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
- * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
- */
-#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
-
-/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -1150,6 +1117,28 @@ extern const int sched_prio_to_weight[40];
 extern const u32 sched_prio_to_wmult[40];
 
 /*
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them.
+ *
+ * The following must also be true:
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
+ */
+#ifdef CONFIG_CFS_INCREASE_LOAD_RANGE
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w)	(w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w)	(w)
+#define kernel_to_user_load(w)	(w)
+#endif
+
+#define NICE_0_LOAD		(1UL << NICE_0_LOAD_SHIFT)
+
+/*
  * {de,en}queue flags:
  *
  * DEQUEUE_SLEEP  - task is no longer runnable
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg
  2016-03-30 20:16 ` [PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
@ 2016-03-31 17:45   ` bsegall
  0 siblings, 0 replies; 10+ messages in thread
From: bsegall @ 2016-03-31 17:45 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, linux-kernel, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

Yuyang Du <yuyang.du@intel.com> writes:

> Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> down of load does not make much sense, because load_avg is primarily THE
> load and on top of that, we take runnable time into account.
>
> We therefore remove scale_load_down() for load_avg. But we need to
> carefully consider the overflow risk if load has higher range
> (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> to us is on 64bit kernel with increased load range. In that case,
> the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> entities with the highest load (=88761*1024) always runnable on one


This is true (ignoring that the cgroup max is a bit higher at 262144,
but that's only ~3x, hardly an issue), but I wondered when the uses of
load_avg in h_load and effective_load hit issues, under the assumption
that maybe the old old code used scale_load_down originally, so here's
the math:

effective_load multiplies load_avg by tg->shares, allowing a maximum of
(2^18)*load_avg=(2^18)*((2^18)*1024)=2^46 per child cgroup, allowing
131072 child cgroups, or ~3x that in tasks. (Half of normal because it
is signed)

h_load however is cfs_rq_load_avg * se->avg.load_avg, so we have an
extra factor of 1024, allowing only 256 max weight cgroups or 9x that in
tasks.

That said, when I checked the old code, it didn't use scale_load_down
back h_load involved cfs_rq->load.weight * se->load.weight, so this is
only new in a recent sense.

TLDR: you're correct that load_avg hits any issues first, and it did
so before any of the new average computations were added.

> single cfs_rq, which may be an issue, but should be fine. Even if this
> occurs at the end of day, on the condition where it occurs, the
> load average will not be useful anyway.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> [update calculate_imbalance]
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  include/linux/sched.h | 19 ++++++++++++++-----
>  kernel/sched/fair.c   | 19 +++++++++----------
>  2 files changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index db3c6e1..8df6d69 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1213,7 +1213,7 @@ struct load_weight {
>   *
>   * [load_avg definition]
>   *
> - * load_avg = runnable% * scale_load_down(load)
> + * load_avg = runnable% * load
>   *
>   * where runnable% is the time ratio that a sched_entity is runnable.
>   * For cfs_rq, it is the aggregated such load_avg of all runnable and
> @@ -1221,7 +1221,7 @@ struct load_weight {
>   *
>   * load_avg may also take frequency scaling into account:
>   *
> - * load_avg = runnable% * scale_load_down(load) * freq%
> + * load_avg = runnable% * load * freq%
>   *
>   * where freq% is the CPU frequency normalize to the highest frequency
>   *
> @@ -1247,9 +1247,18 @@ struct load_weight {
>   *
>   * [Overflow issue]
>   *
> - * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
> - * with the highest load (=88761) always runnable on a single cfs_rq, we
> - * should not overflow as the number already hits PID_MAX_LIMIT.
> + * On 64bit kernel:
> + *
> + * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
> + * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
> + * the highest load (=88761) always runnable on a cfs_rq, we should
> + * not overflow as the number already hits PID_MAX_LIMIT.
> + *
> + * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
> + * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
> + * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
> + * we should be fine. Even if the overflow occurs at the end of day,
> + * at the time the load_avg won't be useful anyway in that situation.
>   *
>   * For all other cases (including 32bit kernel), struct load_weight's
>   * weight will overflow first before we do, because:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf835b5..da6642f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * will definitely be update (after enqueue).
>  	 */
>  	sa->period_contrib = 1023;
> -	sa->load_avg = scale_load_down(se->load.weight);
> +	sa->load_avg = se->load.weight;
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	sa->util_avg = SCHED_CAPACITY_SCALE;
>  	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> @@ -2837,7 +2837,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  	}
>  
>  	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> -		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
> +		cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
>  
>  #ifndef CONFIG_64BIT
>  	smp_wmb();
> @@ -2858,8 +2858,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  	 * Track task load average for carrying it to new CPU after migrated, and
>  	 * track group sched_entity load average for task_h_load calc in migration
>  	 */
> -	__update_load_avg(now, cpu, &se->avg,
> -			  se->on_rq * scale_load_down(se->load.weight),
> +	__update_load_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
>  			  cfs_rq->curr == se, NULL);
>  
>  	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> @@ -2896,7 +2895,7 @@ skip_aging:
>  static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
> -			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
> +			  &se->avg, se->on_rq * se->load.weight,
>  			  cfs_rq->curr == se, NULL);
>  
>  	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
> @@ -2916,7 +2915,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	migrated = !sa->last_update_time;
>  	if (!migrated) {
>  		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> -			se->on_rq * scale_load_down(se->load.weight),
> +			se->on_rq * se->load.weight,
>  			cfs_rq->curr == se, NULL);
>  	}
>  
> @@ -6876,10 +6875,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	 */
>  	if (busiest->group_type == group_overloaded &&
>  	    local->group_type   == group_overloaded) {
> -		load_above_capacity = busiest->sum_nr_running *
> -				      scale_load_down(NICE_0_LOAD);
> -		if (load_above_capacity > busiest->group_capacity)
> -			load_above_capacity -= busiest->group_capacity;
> +		load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
> +		if (load_above_capacity > scale_load(busiest->group_capacity))
> +			load_above_capacity -=
> +				scale_load(busiest->group_capacity);
>  		else
>  			load_above_capacity = ~0UL;
>  	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code
  2016-03-30 20:16 ` [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code Yuyang Du
@ 2016-03-31 17:53   ` bsegall
  2016-03-31 23:20     ` Yuyang Du
  0 siblings, 1 reply; 10+ messages in thread
From: bsegall @ 2016-03-31 17:53 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, linux-kernel, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

Yuyang Du <yuyang.du@intel.com> writes:

> The increased load resolution (fixed point arithmetic range) is
> unconditionally deactivated with #if 0, so it is effectively broken.
>
> But the increased load range is still used somewhere (e.g., in Google),
> so we keep this feature. The reconciliation is we define
> CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> 64BIT and BROKEN.
>
> Suggested-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>

The title of this patch "Remove unconditionally inactive code" is
misleading since it's more like giving it a CONFIG.

Also as a side note, does anyone remember/have a test for whatever got
it turned off to begin with, given all the changes in load tracking and
the load balancer and everything else?

> ---
>  init/Kconfig         | 16 +++++++++++++++
>  kernel/sched/sched.h | 55 +++++++++++++++++++++-------------------------------
>  2 files changed, 38 insertions(+), 33 deletions(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 2232080..d072c09 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1025,6 +1025,22 @@ config CFS_BANDWIDTH
>  	  restriction.
>  	  See tip/Documentation/scheduler/sched-bwc.txt for more information.
>  
> +config CFS_INCREASE_LOAD_RANGE
> +	bool "Increase kernel load range"
> +	depends on 64BIT && BROKEN
> +	default n
> +	help
> +	  Increase resolution of nice-level calculations for 64-bit architectures.
> +	  The extra resolution improves shares distribution and load balancing of
> +	  low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
> +	  hierarchies, especially on larger systems. This is not a user-visible change
> +	  and does not change the user-interface for setting shares/weights.
> +	  We increase resolution only if we have enough bits to allow this increased
> +	  resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
> +	  when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
> +	  increased costs.
> +	  Currently broken: it increases power usage under light load.
> +
>  config RT_GROUP_SCHED
>  	bool "Group scheduling for SCHED_RR/FIFO"
>  	depends on CGROUP_SCHED
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ebe16e3..1bb0d69 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -42,39 +42,6 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>  #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
>  
>  /*
> - * Increase resolution of nice-level calculations for 64-bit architectures.
> - * The extra resolution improves shares distribution and load balancing of
> - * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
> - * hierarchies, especially on larger systems. This is not a user-visible change
> - * and does not change the user-interface for setting shares/weights.
> - *
> - * We increase resolution only if we have enough bits to allow this increased
> - * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
> - * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
> - * increased costs.
> - */
> -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> -# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> -# define user_to_kernel_load(w)	((w) << SCHED_FIXEDPOINT_SHIFT)
> -# define kernel_to_user_load(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
> -#else
> -# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
> -# define user_to_kernel_load(w)	(w)
> -# define kernel_to_user_load(w)	(w)
> -#endif
> -
> -/*
> - * Task weight (visible to user) and its load (invisible to user) have
> - * independent resolution, but they should be well calibrated. We use
> - * user_to_kernel_load() and kernel_to_user_load(w) to convert between
> - * them. The following must be true:
> - *
> - * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
> - * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
> - */
> -#define NICE_0_LOAD		(1L << NICE_0_LOAD_SHIFT)
> -
> -/*
>   * Single value that decides SCHED_DEADLINE internal math precision.
>   * 10 -> just above 1us
>   * 9  -> just above 0.5us
> @@ -1150,6 +1117,28 @@ extern const int sched_prio_to_weight[40];
>  extern const u32 sched_prio_to_wmult[40];
>  
>  /*
> + * Task weight (visible to user) and its load (invisible to user) have
> + * independent ranges, but they should be well calibrated. We use
> + * user_to_kernel_load() and kernel_to_user_load(w) to convert between
> + * them.
> + *
> + * The following must also be true:
> + * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
> + * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
> + */
> +#ifdef CONFIG_CFS_INCREASE_LOAD_RANGE
> +#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> +#define user_to_kernel_load(w)	(w << SCHED_FIXEDPOINT_SHIFT)
> +#define kernel_to_user_load(w)	(w >> SCHED_FIXEDPOINT_SHIFT)
> +#else
> +#define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
> +#define user_to_kernel_load(w)	(w)
> +#define kernel_to_user_load(w)	(w)
> +#endif
> +
> +#define NICE_0_LOAD		(1UL << NICE_0_LOAD_SHIFT)
> +
> +/*
>   * {de,en}queue flags:
>   *
>   * DEQUEUE_SLEEP  - task is no longer runnable

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code
  2016-03-31 17:53   ` bsegall
@ 2016-03-31 23:20     ` Yuyang Du
  0 siblings, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2016-03-31 23:20 UTC (permalink / raw)
  To: bsegall
  Cc: peterz, mingo, linux-kernel, pjt, morten.rasmussen,
	vincent.guittot, dietmar.eggemann, lizefan, umgwanakikbuti

On Thu, Mar 31, 2016 at 10:53:02AM -0700, bsegall@google.com wrote:
> Yuyang Du <yuyang.du@intel.com> writes:
> 
> > The increased load resolution (fixed point arithmetic range) is
> > unconditionally deactivated with #if 0, so it is effectively broken.
> >
> > But the increased load range is still used somewhere (e.g., in Google),
> > so we keep this feature. The reconciliation is we define
> > CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> > 64BIT and BROKEN.
> >
> > Suggested-by: Ingo Molnar <mingo@kernel.org>
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> 
> The title of this patch "Remove unconditionally inactive code" is
> misleading since it's more like giving it a CONFIG.

Reasonable argument.

> 
> Also as a side note, does anyone remember/have a test for whatever got
> it turned off to begin with, given all the changes in load tracking and
> the load balancer and everything else?

It is this commit that turned it off:

Commit e4c2fb0d5776: "sched: Disable (revert) SCHED_LOAD_SCALE increase"

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-04-01  7:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-30 20:16 [PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions Yuyang Du
2016-03-30 20:16 ` [PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
2016-03-30 20:16 ` [PATCH RESEND v2 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
2016-03-30 20:16 ` [PATCH RESEND v2 3/6] sched/fair: Add introduction to the sched load avg metrics Yuyang Du
2016-03-30 20:16 ` [PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-03-31 17:45   ` bsegall
2016-03-30 20:16 ` [PATCH RESEND v2 5/6] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-03-30 20:16 ` [PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code Yuyang Du
2016-03-31 17:53   ` bsegall
2016-03-31 23:20     ` Yuyang Du

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.