[RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again)..

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again)..
@ 2017-05-12 16:44 Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

Hi all,

So after staring at all that PELT stuff and working my way through it again:

  https://lkml.kernel.org/r/20170505154117.6zldxuki2fgyo53n@hirez.programming.kicks-ass.net

I started doing some patches to fix some of the identified broken.

So here are a few too many patches that do:

 - fix 'reweight_entity' to instantly propagate the change in se->load.weight.

 - rewrite/fix the propagate on migrate (attach/detach)

 - introduce the hierarchical runnable_load_avg, as proposed by Tejun.

 - synchronous detach for runnable migrates

 - aligns the PELT windows between a cfs_rq and all its se's

 - deals with random fallout from the above (some of this needs folding back
   and reordering, but its all well past the point I should post this anyway).

IIRC pjt recently mentioned the reweight_entity thing, and I have very vague
memories he once talked about the window alignment thing -- which I only
remembered after (very painfully) having discovered I really needed that.

In any case, the reason I did the reweight_entity thing first, is because I
feel that is the right place to also propagate the hierarchical runnable_load,
as that is the natural place where a group's cfs_rq is coupled to its
sched_entity.

And the hierachical runnable_load needs that coupling. TJ did it by hijacking
the attach/detach migrate code, which I didn't much like.  In any case, all
that got me looking at said attach/detach migrate code and find pain. So I went
and fixed that too.

Much thanks to Vincent and Dietmar for poking at early versions and reporting
failure and comments.

This still hasn't had a lot of testing, but its not obviously insane anymore
for the few tests we did do on it. Thread carefully though, a lot of code changed.

This can also be found here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/experimental

(my git tree is unsable and gets rebased a _lot_)

---
 include/linux/sched.h |    3 
 kernel/sched/debug.c  |   18 -
 kernel/sched/fair.c   |  811 +++++++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h  |   19 -
 4 files changed, 547 insertions(+), 304 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares()
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-16  8:02   ` Vincent Guittot
  2017-05-12 16:44 ` [RFC][PATCH 02/14] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-cleanup-update_cfs_shares.patch --]
[-- Type: text/plain, Size: 2179 bytes --]

For consistencies sake, we should have only a single reading of
tg->shares.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2633,9 +2633,12 @@ account_entity_dequeue(struct cfs_rq *cf
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static long calc_cfs_shares(struct cfs_rq *cfs_rq)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
+	struct task_group *tg = cfs_rq->tg;
+
+	tg_shares = READ_ONCE(tg->shares);
 
 	/*
 	 * This really should be: cfs_rq->avg.load_avg, but instead we use
@@ -2650,7 +2653,7 @@ static long calc_cfs_shares(struct cfs_r
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
 	tg_weight += load;
 
-	shares = (tg->shares * load);
+	shares = (tg_shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
@@ -2666,17 +2669,7 @@ static long calc_cfs_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
-}
-# else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
-{
-	return tg->shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # endif /* CONFIG_SMP */
 
@@ -2701,7 +2694,6 @@ static inline int throttled_hierarchy(st
 static void update_cfs_shares(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	struct task_group *tg;
 	long shares;
 
 	if (!cfs_rq)
@@ -2710,13 +2702,14 @@ static void update_cfs_shares(struct sch
 	if (throttled_hierarchy(cfs_rq))
 		return;
 
-	tg = cfs_rq->tg;
-
 #ifndef CONFIG_SMP
-	if (likely(se->load.weight == tg->shares))
+	shares = READ_ONCE(cfs_rq->tg->shares);
+
+	if (likely(se->load.weight == shares))
 		return;
+#else
+	shares = calc_cfs_shares(cfs_rq);
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 02/14] sched/fair: Add comment to calc_cfs_shares()
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-comment-calc_cfs_shares.patch --]
[-- Type: text/plain, Size: 2859 bytes --]

Explain the magic equation in calc_cfs_shares() a bit better.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2633,6 +2633,67 @@ account_entity_dequeue(struct cfs_rq *cf
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
+/*
+ * All this does is approximate the hierarchical proportion which includes that
+ * global sum we all love to hate.
+ *
+ * That is, the weight of a group entity, is the proportional share of the
+ * group weight based on the group runqueue weights. That is:
+ *
+ *                     tg->weight * grq->load.weight
+ *   ge->load.weight = -----------------------------               (1)
+ *       		  \Sum grq->load.weight
+ *
+ * Now, because computing that sum is prohibitively expensive to compute (been
+ * there, done that) we approximate it with this average stuff. The average
+ * moves slower and therefore the approximation is cheaper and more stable.
+ *
+ * So instead of the above, we substitute:
+ *
+ *   grq->load.weight -> grq->avg.load_avg                         (2)
+ *
+ * which yields the following:
+ *
+ *                     tg->weight * grq->avg.load_avg
+ *   ge->load.weight = ------------------------------              (3)
+ *       		      tg->load_avg
+ *
+ * Where: tg->load_avg ~= \Sum grq->avg.load_avg
+ *
+ * That is shares_avg, and it is right (given the approximation (2)).
+ *
+ * The problem with it is that because the average is slow -- it was designed
+ * to be exactly that of course -- this leads to transients in boundary
+ * conditions. In specific, the case where the group was idle and we start the
+ * one task. It takes time for our CPU's grq->avg.load_avg to build up,
+ * yielding bad latency etc..
+ *
+ * Now, in that special case (1) reduces to:
+ *
+ *                     tg->weight * grq->load.weight
+ *   ge->load.weight = ----------------------------- = tg>weight   (4)
+ *       		  grp->load.weight
+ *
+ * That is, the sum collapses because all other CPUs are idle; the UP scenario.
+ *
+ * So what we do is modify our approximation (3) to approach (4) in the (near)
+ * UP case, like:
+ *
+ *   ge->load.weight =
+ *
+ *              tg->weight * grq->load.weight
+ *     ---------------------------------------------------         (5)
+ *     tg->load_avg - grq->avg.load_avg + grq->load.weight
+ *
+ *
+ * And that is shares_weight and is icky. In the (near) UP case it approaches
+ * (4) while in the normal case it approaches (3). It consistently
+ * overestimates the ge->load.weight and therefore:
+ *
+ *   \Sum ge->load.weight >= tg->weight
+ *
+ * hence icky!
+ */
 static long calc_cfs_shares(struct cfs_rq *cfs_rq)
 {
 	long tg_weight, tg_shares, load, shares;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 02/14] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-17  7:04   ` Vincent Guittot
  2017-05-12 16:44 ` [RFC][PATCH 04/14] sched/fair: More accurate reweight_entity() Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-unweight-entity.patch --]
[-- Type: text/plain, Size: 6482 bytes --]

Remove the load from the load_sum for sched_entities, basically
turning load_sum into runnable_sum.  This prepares for better
reweighting of group entities.

Since we now have different rules for computing load_avg, split
___update_load_avg() into two parts, ___update_load_sum() and
___update_load_avg().

So for se:

  ___update_load_sum(.weight = 1)
  ___upate_load_avg(.weight = se->load.weight)

and for cfs_rq:

  ___update_load_sum(.weight = cfs_rq->load.weight)
  ___upate_load_avg(.weight = 1)

Since the primary consumable is load_avg, most things will not be
affected. Only those few sites that initialize/modify load_sum need
attention.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   91 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 64 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -744,7 +744,7 @@ void init_entity_runnable_average(struct
 	 */
 	if (entity_is_task(se))
 		sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+	sa->load_sum = LOAD_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -1967,7 +1967,7 @@ static u64 numa_get_avg_runtime(struct t
 		delta = runtime - p->last_sum_exec_runtime;
 		*period = now - p->last_task_numa_placement;
 	} else {
-		delta = p->se.avg.load_sum / p->se.load.weight;
+		delta = p->se.avg.load_sum;
 		*period = LOAD_AVG_MAX;
 	}
 
@@ -2872,8 +2872,8 @@ accumulate_sum(u64 delta, int cpu, struc
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta;
 
@@ -2907,39 +2907,80 @@ ___update_load_avg(u64 now, int cpu, str
 	if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
 		return 0;
 
+	return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long weight, struct cfs_rq *cfs_rq)
+{
+	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
 	/*
 	 * Step 2: update *_avg.
 	 */
-	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
+	sa->load_avg = div_u64(weight * sa->load_sum, divider);
 	if (cfs_rq) {
 		cfs_rq->runnable_load_avg =
-			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
+			div_u64(cfs_rq->runnable_load_sum, divider);
 	}
-	sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
+	sa->util_avg = sa->util_sum / divider;
+}
 
-	return 1;
+/*
+ * XXX we want to get rid of this helper and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+	return scale_load_down(se->load.weight);
 }
 
+/*
+ * sched_entity:
+ *
+ *   load_sum := runnable_sum
+ *   load_avg = se_weight(se) * runnable_avg
+ *
+ * cfq_rs:
+ *
+ *   load_sum = \Sum se_weight(se) * se->avg.load_sum
+ *   load_avg = \Sum se->avg.load_avg
+ */
+
 static int
 __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
 {
-	return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL);
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, NULL)) {
+		___update_load_avg(&se->avg, se_weight(se), NULL);
+		return 1;
+	}
+
+	return 0;
 }
 
 static int
 __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return ___update_load_avg(now, cpu, &se->avg,
-				  se->on_rq * scale_load_down(se->load.weight),
-				  cfs_rq->curr == se, NULL);
+	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq,
+				cfs_rq->curr == se, NULL)) {
+
+		___update_load_avg(&se->avg, se_weight(se), NULL);
+		return 1;
+	}
+
+	return 0;
 }
 
 static int
 __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 {
-	return ___update_load_avg(now, cpu, &cfs_rq->avg,
-			scale_load_down(cfs_rq->load.weight),
-			cfs_rq->curr != NULL, cfs_rq);
+	if (___update_load_sum(now, cpu, &cfs_rq->avg,
+				scale_load_down(cfs_rq->load.weight),
+				cfs_rq->curr != NULL, cfs_rq)) {
+		___update_load_avg(&cfs_rq->avg, 1, cfs_rq);
+		return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -3110,7 +3151,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 
 	/* Set new sched_entity's load */
 	se->avg.load_avg = load;
-	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
+	se->avg.load_sum = LOAD_AVG_MAX;
 
 	/* Update parent cfs_rq load */
 	add_positive(&cfs_rq->avg.load_avg, delta);
@@ -3340,7 +3381,7 @@ static void attach_entity_load_avg(struc
 {
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
-	cfs_rq->avg.load_sum += se->avg.load_sum;
+	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 	set_tg_cfs_propagate(cfs_rq);
@@ -3360,7 +3401,7 @@ static void detach_entity_load_avg(struc
 {
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 	set_tg_cfs_propagate(cfs_rq);
@@ -3372,12 +3413,10 @@ static void detach_entity_load_avg(struc
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct sched_avg *sa = &se->avg;
-
-	cfs_rq->runnable_load_avg += sa->load_avg;
-	cfs_rq->runnable_load_sum += sa->load_sum;
+	cfs_rq->runnable_load_avg += se->avg.load_avg;
+	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
 
-	if (!sa->last_update_time) {
+	if (!se->avg.last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
 		update_tg_load_avg(cfs_rq, 0);
 	}
@@ -3387,10 +3426,8 @@ enqueue_entity_load_avg(struct cfs_rq *c
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_load_avg =
-		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
-	cfs_rq->runnable_load_sum =
-		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
+	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
 }
 
 #ifndef CONFIG_64BIT

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 04/14] sched/fair: More accurate reweight_entity()
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 05/14] sched/fair: Change update_load_avg() arguments Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fix-reweight_entity.patch --]
[-- Type: text/plain, Size: 4183 bytes --]

When a (group) entity changes it's weight we should instantly change
its load_avg and propagate that change into the sums it is part of.
Because we use these values to predict future behaviour and are not
interested in its historical value.

Without this change, the change in load would need to propagate
through the average, by which time it could again have changed etc..
always chasing itself.

With this change, the cfs_rq load_avg sum will more accurately reflect
the current runnable and expected return of blocked load.

Reported-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   75 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 46 insertions(+), 29 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2673,9 +2673,41 @@ static long calc_cfs_shares(struct cfs_r
 }
 # endif /* CONFIG_SMP */
 
+/*
+ * Signed add and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define add_positive(_ptr, _val) do {                           \
+	typeof(_ptr) ptr = (_ptr);                              \
+	typeof(_val) val = (_val);                              \
+	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
+								\
+	res = var + val;                                        \
+								\
+	if (val < 0 && res > var)                               \
+		res = 0;                                        \
+								\
+	WRITE_ONCE(*ptr, res);                                  \
+} while (0)
+
+/*
+ * XXX we want to get rid of this helper and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+	return scale_load_down(se->load.weight);
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	unsigned long se_load_avg = se->avg.load_avg;
+	u64 se_load_sum = se_weight(se) * se->avg.load_sum;
+	u64 new_load_sum = scale_load_down(weight) * se->avg.load_sum;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -2683,10 +2715,23 @@ static void reweight_entity(struct cfs_r
 		account_entity_dequeue(cfs_rq, se);
 	}
 
+	se->avg.load_avg = div_u64(new_load_sum,
+			LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
+
 	update_load_set(&se->load, weight);
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
+		add_positive(&cfs_rq->runnable_load_avg,
+				(long)(se->avg.load_avg - se_load_avg));
+		add_positive(&cfs_rq->runnable_load_sum,
+				(s64)(new_load_sum - se_load_sum));
+	}
+
+	add_positive(&cfs_rq->avg.load_avg,
+			(long)(se->avg.load_avg - se_load_avg));
+	add_positive(&cfs_rq->avg.load_sum,
+			(s64)(new_load_sum - se_load_sum));
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -2927,14 +2972,6 @@ ___update_load_avg(struct sched_avg *sa,
 }
 
 /*
- * XXX we want to get rid of this helper and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
-	return scale_load_down(se->load.weight);
-}
-
-/*
  * sched_entity:
  *
  *   load_sum := runnable_sum
@@ -2983,26 +3020,6 @@ __update_load_avg_cfs_rq(u64 now, int cp
 	return 0;
 }
 
-/*
- * Signed add and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define add_positive(_ptr, _val) do {                           \
-	typeof(_ptr) ptr = (_ptr);                              \
-	typeof(_val) val = (_val);                              \
-	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
-								\
-	res = var + val;                                        \
-								\
-	if (val < 0 && res > var)                               \
-		res = 0;                                        \
-								\
-	WRITE_ONCE(*ptr, res);                                  \
-} while (0)
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /**
  * update_tg_load_avg - update the tg's load avg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 05/14] sched/fair: Change update_load_avg() arguments
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 04/14] sched/fair: More accurate reweight_entity() Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-17 10:46   ` Vincent Guittot
  2017-05-12 16:44 ` [RFC][PATCH 06/14] sched/fair: Move enqueue migrate handling Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-update_load_avg-args.patch --]
[-- Type: text/plain, Size: 4657 bytes --]

Most call sites of update_load_avg() already have cfs_rq_of(se)
available, pass it down instead of recomputing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3349,9 +3349,8 @@ update_cfs_rq_load_avg(u64 now, struct c
 #define SKIP_AGE_LOAD	0x2
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int flags)
+static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
@@ -3512,9 +3511,9 @@ update_cfs_rq_load_avg(u64 now, struct c
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 
-static inline void update_load_avg(struct sched_entity *se, int not_used1)
+static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
-	cpufreq_update_util(rq_of(cfs_rq_of(se)), 0);
+	cpufreq_update_util(rq_of(cfs_rq), 0);
 }
 
 static inline void
@@ -3665,7 +3664,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
-	update_load_avg(se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, UPDATE_TG);
 	enqueue_entity_load_avg(cfs_rq, se);
 	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);
@@ -3749,7 +3748,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
-	update_load_avg(se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, UPDATE_TG);
 	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -3837,7 +3836,7 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3939,7 +3938,7 @@ static void put_prev_entity(struct cfs_r
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_load_avg(cfs_rq, prev, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3955,7 +3954,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, UPDATE_TG);
+	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_shares(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4873,7 +4872,7 @@ enqueue_task_fair(struct rq *rq, struct
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_shares(se);
 	}
 
@@ -4932,7 +4931,7 @@ static void dequeue_task_fair(struct rq
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_shares(se);
 	}
 
@@ -7030,7 +7029,7 @@ static void update_blocked_averages(int
 		/* Propagate pending load changes to the parent, if any: */
 		se = cfs_rq->tg->se[cpu];
 		if (se && !skip_blocked_update(se))
-			update_load_avg(se, 0);
+			update_load_avg(cfs_rq_of(se), se, 0);
 
 		/*
 		 * There can be a lot of idle CPU cgroups.  Don't let fully
@@ -9159,7 +9158,7 @@ static void propagate_entity_cfs_rq(stru
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
 }
 #else
@@ -9171,7 +9170,7 @@ static void detach_entity_cfs_rq(struct
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_load_avg(se, 0);
+	update_load_avg(cfs_rq, se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 	propagate_entity_cfs_rq(se);
@@ -9190,7 +9189,7 @@ static void attach_entity_cfs_rq(struct
 #endif
 
 	/* Synchronize entity with its cfs_rq */
-	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
+	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 	propagate_entity_cfs_rq(se);
@@ -9474,7 +9473,7 @@ int sched_group_set_shares(struct task_g
 		rq_lock_irqsave(rq, &rf);
 		update_rq_clock(rq);
 		for_each_sched_entity(se) {
-			update_load_avg(se, UPDATE_TG);
+			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
 			update_cfs_shares(se);
 		}
 		rq_unlock_irqrestore(rq, &rf);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 06/14] sched/fair: Move enqueue migrate handling
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 05/14] sched/fair: Change update_load_avg() arguments Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-29 13:41   ` Vincent Guittot
  2017-05-12 16:44 ` [RFC][PATCH 07/14] sched/fair: Rewrite cfs_rq->removed_*avg Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-pull-migrate-into-update_load_avg.patch --]
[-- Type: text/plain, Size: 4086 bytes --]

Move the entity migrate handling from enqueue_entity_load_avg() to
update_load_avg(). This has two benefits:

 - {en,de}queue_entity_load_avg() will become purely about managing
   runnable_load

 - we can avoid a double update_tg_load_avg() and reduce pressure on
   the global tg->shares cacheline

The reason we do this is so that we can change update_cfs_shares() to
change both weight and (future) runnable_weight. For this to work we
need to have the cfs_rq averages up-to-date (which means having done
the attach), but we need the cfs_rq->avg.runnable_avg to not yet
include the se's contribution (since se->on_rq == 0).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   70 ++++++++++++++++++++++++++--------------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3418,34 +3418,6 @@ update_cfs_rq_load_avg(u64 now, struct c
 	return decayed || removed_load;
 }
 
-/*
- * Optional action to be done while updating the load average
- */
-#define UPDATE_TG	0x1
-#define SKIP_AGE_LOAD	0x2
-
-/* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
-{
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	struct rq *rq = rq_of(cfs_rq);
-	int cpu = cpu_of(rq);
-	int decayed;
-
-	/*
-	 * Track task load average for carrying it to new CPU after migrated, and
-	 * track group sched_entity load average for task_h_load calc in migration
-	 */
-	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-		__update_load_avg_se(now, cpu, cfs_rq, se);
-
-	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
-	decayed |= propagate_entity_load_avg(se);
-
-	if (decayed && (flags & UPDATE_TG))
-		update_tg_load_avg(cfs_rq, 0);
-}
-
 /**
  * attach_entity_load_avg - attach this entity to its cfs_rq load avg
  * @cfs_rq: cfs_rq to attach to
@@ -3486,17 +3458,46 @@ static void detach_entity_load_avg(struc
 	cfs_rq_util_change(cfs_rq);
 }
 
+/*
+ * Optional action to be done while updating the load average
+ */
+#define UPDATE_TG	0x1
+#define SKIP_AGE_LOAD	0x2
+#define DO_ATTACH	0x4
+
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	u64 now = cfs_rq_clock_task(cfs_rq);
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
+	int decayed;
+
+	/*
+	 * Track task load average for carrying it to new CPU after migrated, and
+	 * track group sched_entity load average for task_h_load calc in migration
+	 */
+	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+		__update_load_avg_se(now, cpu, cfs_rq, se);
+
+	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
+	decayed |= propagate_entity_load_avg(se);
+
+	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
+
+		attach_entity_load_avg(cfs_rq, se);
+		update_tg_load_avg(cfs_rq, 0);
+
+	} else if (decayed && (flags & UPDATE_TG))
+		update_tg_load_avg(cfs_rq, 0);
+}
+
 /* Add the load generated by se into cfs_rq's load average */
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	cfs_rq->runnable_load_avg += se->avg.load_avg;
 	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
-
-	if (!se->avg.last_update_time) {
-		attach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq, 0);
-	}
 }
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
@@ -3586,6 +3587,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
+#define DO_ATTACH	0x0
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
@@ -3740,7 +3742,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	enqueue_entity_load_avg(cfs_rq, se);
 	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 07/14] sched/fair: Rewrite cfs_rq->removed_*avg
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 06/14] sched/fair: Move enqueue migrate handling Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 08/14] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-update_tg_cfs_load.patch --]
[-- Type: text/plain, Size: 5351 bytes --]

Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.

Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.

Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    8 ++++----
 kernel/sched/fair.c  |   51 +++++++++++++++++++++++++++++++++++----------------
 kernel/sched/sched.h |   13 +++++++++----
 3 files changed, 48 insertions(+), 24 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -519,10 +519,10 @@ void print_cfs_rq(struct seq_file *m, in
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed_load_avg",
-			atomic_long_read(&cfs_rq->removed_load_avg));
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed_util_avg",
-			atomic_long_read(&cfs_rq->removed_util_avg));
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
+			cfs_rq->removed.load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
+			cfs_rq->removed.util_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3386,36 +3386,47 @@ static inline void cfs_rq_util_change(st
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
+	unsigned long removed_load = 0, removed_util = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
-	int decayed, removed_load = 0, removed_util = 0;
+	int decayed = 0;
 
-	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
-		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
+	if (cfs_rq->removed.nr) {
+		unsigned long r;
+
+		raw_spin_lock(&cfs_rq->removed.lock);
+		swap(cfs_rq->removed.util_avg, removed_util);
+		swap(cfs_rq->removed.load_avg, removed_load);
+		cfs_rq->removed.nr = 0;
+		raw_spin_unlock(&cfs_rq->removed.lock);
+
+		/*
+		 * The LOAD_AVG_MAX for _sum is a slight over-estimate,
+		 * which is safe due to sub_positive() clipping at 0.
+		 */
+		r = removed_load;
 		sub_positive(&sa->load_avg, r);
 		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
-		removed_load = 1;
-		set_tg_cfs_propagate(cfs_rq);
-	}
 
-	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
-		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
+		r = removed_util;
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
-		removed_util = 1;
+
 		set_tg_cfs_propagate(cfs_rq);
+
+		decayed = 1;
 	}
 
-	decayed = __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
+	decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (update_freq && (decayed || removed_util))
+	if (update_freq && decayed)
 		cfs_rq_util_change(cfs_rq);
 
-	return decayed || removed_load;
+	return decayed;
 }
 
 /**
@@ -3548,6 +3559,7 @@ void sync_entity_load_avg(struct sched_e
 void remove_entity_load_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	unsigned long flags;
 
 	/*
 	 * tasks cannot exit without having gone through wake_up_new_task() ->
@@ -3557,11 +3569,19 @@ void remove_entity_load_avg(struct sched
 	 * Similarly for groups, they will have passed through
 	 * post_init_entity_util_avg() before unregister_sched_fair_group()
 	 * calls this.
+	 *
+	 * XXX in case entity_is_task(se) && task_of(se)->on_rq == MIGRATING
+	 * we could actually get the right time, since we're called with
+	 * rq->lock held, see detach_task().
 	 */
 
 	sync_entity_load_avg(se);
-	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
-	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
+
+	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
+	++cfs_rq->removed.nr;
+	cfs_rq->removed.util_avg	+= se->avg.util_avg;
+	cfs_rq->removed.load_avg	+= se->avg.load_avg;
+	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
@@ -9350,8 +9370,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	cfs_rq->propagate_avg = 0;
 #endif
-	atomic_long_set(&cfs_rq->removed_load_avg, 0);
-	atomic_long_set(&cfs_rq->removed_util_avg, 0);
+	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -424,14 +424,19 @@ struct cfs_rq {
 	struct sched_avg avg;
 	u64 runnable_load_sum;
 	unsigned long runnable_load_avg;
+#ifndef CONFIG_64BIT
+	u64 load_last_update_time_copy;
+#endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
 	unsigned long propagate_avg;
 #endif
-	atomic_long_t removed_load_avg, removed_util_avg;
-#ifndef CONFIG_64BIT
-	u64 load_last_update_time_copy;
-#endif
+	struct {
+		raw_spinlock_t	lock ____cacheline_aligned;
+		int		nr;
+		unsigned long	load_avg;
+		unsigned long	util_avg;
+	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 08/14] sched/fair: Rewrite PELT migration propagation
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 07/14] sched/fair: Rewrite cfs_rq->removed_*avg Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 09/14] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-update_tg_cfs_load-1.patch --]
[-- Type: text/plain, Size: 11620 bytes --]

When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.

In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):

                     tg->weight * grq->avg.load_avg
  ge->avg.load_avg = ------------------------------
                               tg->load_avg

But that is the expression for ge->weight, and per the definition of
load_avg:

  ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.

Instead directly propagate runnable_sum.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    2 
 kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
 kernel/sched/sched.h |    9 +-
 3 files changed, 112 insertions(+), 85 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -523,6 +523,8 @@ void print_cfs_rq(struct seq_file *m, in
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
+			cfs_rq->removed.runnable_sum);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3163,11 +3163,77 @@ void set_task_rq_fair(struct sched_entit
 	se->avg.last_update_time = n_last_update_time;
 }
 
-/* Take into account change of utilization of a child task group */
+
+/*
+ * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
+ * propagate its contribution. The key to this propagation is the invariant
+ * that for each group:
+ *
+ *   ge->avg == grq->avg						(1)
+ *
+ * _IFF_ we look at the pure running and runnable sums. Because they
+ * represent the very same entity, just at different points in the hierarchy.
+ *
+ *
+ * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
+ * simply copies the running sum over.
+ *
+ * However, update_tg_cfs_runnable() is more complex. So we have:
+ *
+ *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
+ *
+ * And since, like util, the runnable part should be directly transferable,
+ * the following would _appear_ to be the straight forward approach:
+ *
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *
+ * And per (1) we have:
+ *
+ *   ge->avg.running_avg == grq->avg.running_avg
+ *
+ * Which gives:
+ *
+ *                      ge->load.weight * grq->avg.load_avg
+ *   ge->avg.load_avg = -----------------------------------		(4)
+ *                               grq->load.weight
+ *
+ * Except that is wrong!
+ *
+ * Because while for entities historical weight is not important and we
+ * really only care about our future and therefore can consider a pure
+ * runnable sum, runqueues can NOT do this.
+ *
+ * We specifically want runqueues to have a load_avg that includes
+ * historical weights. Those represent the blocked load, the load we expect
+ * to (shortly) return to us. This only works by keeping the weights as
+ * integral part of the sum. We therefore cannot decompose as per (3).
+ *
+ * OK, so what then?
+ *
+ *
+ * Another way to look at things is:
+ *
+ *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *
+ * Therefore, per (2):
+ *
+ *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ *
+ * And the very thing we're propagating is a change in that sum (someone
+ * joined/left). So we can easily know the runnable change, which would be, per
+ * (2) the already tracked se->load_avg divided by the corresponding
+ * se->weight.
+ *
+ * Basically (4) but in differential form:
+ *
+ *   d(runnable_avg) += se->avg.load_avg / se->load.weight
+ *								   (5)
+ *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
+ */
+
 static inline void
-update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
 
 	/* Nothing to update */
@@ -3183,102 +3249,59 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
-/* Take into account change of load of a child task group */
 static inline void
-update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
-
-	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
-	 */
-	if (load) {
-		long tg_load;
+	long runnable_sum = gcfs_rq->prop_runnable_sum;
+	long load_avg;
+	s64 load_sum;
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
-	}
+	if (!runnable_sum)
+		return;
 
-	delta = load - se->avg.load_avg;
+	gcfs_rq->prop_runnable_sum = 0;
 
-	/* Nothing to update */
-	if (!delta)
-		return;
+	load_sum = (s64)se_weight(se) * runnable_sum;
+	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	/* Set new sched_entity's load */
-	se->avg.load_avg = load;
-	se->avg.load_sum = LOAD_AVG_MAX;
+	add_positive(&se->avg.load_sum, runnable_sum);
+	add_positive(&se->avg.load_avg, load_avg);
 
-	/* Update parent cfs_rq load */
-	add_positive(&cfs_rq->avg.load_avg, delta);
-	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+	add_positive(&cfs_rq->avg.load_avg, load_avg);
+	add_positive(&cfs_rq->avg.load_sum, load_sum);
 
-	/*
-	 * If the sched_entity is already enqueued, we also have to update the
-	 * runnable load avg.
-	 */
 	if (se->on_rq) {
-		/* Update parent cfs_rq runnable_load_avg */
-		add_positive(&cfs_rq->runnable_load_avg, delta);
-		cfs_rq->runnable_load_sum = cfs_rq->runnable_load_avg * LOAD_AVG_MAX;
+		add_positive(&cfs_rq->runnable_load_avg, load_avg);
+		add_positive(&cfs_rq->runnable_load_sum, load_sum);
 	}
 }
 
-static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
-{
-	cfs_rq->propagate_avg = 1;
-}
-
-static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
+static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
 {
-	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-
-	if (!cfs_rq->propagate_avg)
-		return 0;
-
-	cfs_rq->propagate_avg = 0;
-	return 1;
+	cfs_rq->propagate = 1;
+	cfs_rq->prop_runnable_sum += runnable_sum;
 }
 
 /* Update task and its cfs_rq load average */
 static inline int propagate_entity_load_avg(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *gcfs_rq;
 
 	if (entity_is_task(se))
 		return 0;
 
-	if (!test_and_clear_tg_cfs_propagate(se))
+	gcfs_rq = group_cfs_rq(se);
+	if (!gcfs_rq->propagate)
 		return 0;
 
+	gcfs_rq->propagate = 0;
+
 	cfs_rq = cfs_rq_of(se);
 
-	set_tg_cfs_propagate(cfs_rq);
+	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
-	update_tg_cfs_util(cfs_rq, se);
-	update_tg_cfs_load(cfs_rq, se);
+	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 
 	return 1;
 }
@@ -3302,7 +3325,7 @@ static inline bool skip_blocked_update(s
 	 * If there is a pending propagation, we have to update the load and
 	 * the utilization of the sched_entity:
 	 */
-	if (gcfs_rq->propagate_avg)
+	if (gcfs_rq->propagate)
 		return false;
 
 	/*
@@ -3322,7 +3345,7 @@ static inline int propagate_entity_load_
 	return 0;
 }
 
-static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
+static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -3386,7 +3409,7 @@ static inline void cfs_rq_util_change(st
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
-	unsigned long removed_load = 0, removed_util = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3396,6 +3419,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
+		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3411,7 +3435,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 
-		set_tg_cfs_propagate(cfs_rq);
+		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
 
 		decayed = 1;
 	}
@@ -3444,7 +3468,8 @@ static void attach_entity_load_avg(struc
 	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
-	set_tg_cfs_propagate(cfs_rq);
+
+	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3464,7 +3489,8 @@ static void detach_entity_load_avg(struc
 	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
-	set_tg_cfs_propagate(cfs_rq);
+
+	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3582,6 +3608,7 @@ void remove_entity_load_avg(struct sched
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
+	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
@@ -9369,9 +9396,6 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
 #ifdef CONFIG_SMP
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	cfs_rq->propagate_avg = 0;
-#endif
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -427,18 +427,19 @@ struct cfs_rq {
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	unsigned long tg_load_avg_contrib;
-	unsigned long propagate_avg;
-#endif
 	struct {
 		raw_spinlock_t	lock ____cacheline_aligned;
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
+		unsigned long	runnable_sum;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	unsigned long tg_load_avg_contrib;
+	long propagate;
+	long prop_runnable_sum;
+
 	/*
 	 *   h_load = weight * f(tg)
 	 *

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 09/14] sched/fair: Propagate an effective runnable_load_avg
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 08/14] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 10/14] sched/fair: more obvious Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: tj-sched-fair-avg-runnable_load.patch --]
[-- Type: text/plain, Size: 17621 bytes --]

The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:

  runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.

However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.

Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.

Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    3 
 kernel/sched/debug.c  |    8 +
 kernel/sched/fair.c   |  228 ++++++++++++++++++++++++++++++--------------------
 kernel/sched/sched.h  |    3 
 4 files changed, 150 insertions(+), 92 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -314,9 +314,11 @@ struct load_weight {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 };
 
@@ -359,6 +361,7 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
+	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -396,9 +396,11 @@ static void print_cfs_group_stats(struct
 		P_SCHEDSTAT(se->statistics.wait_count);
 	}
 	P(se->load.weight);
+	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -513,10 +515,11 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
+	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->runnable_load_avg);
+			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
@@ -947,10 +950,13 @@ void proc_sched_show_task(struct task_st
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
+	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
+	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
+	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -743,8 +743,9 @@ void init_entity_runnable_average(struct
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = LOAD_AVG_MAX;
+		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
+	sa->runnable_load_sum = sa->load_sum = LOAD_AVG_MAX;
+
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -2755,38 +2756,83 @@ static long calc_cfs_shares(struct cfs_r
 } while (0)
 
 /*
- * XXX we want to get rid of this helper and use the full load resolution.
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
+/*
+ * XXX we want to get rid of these helpers and use the full load resolution.
  */
 static inline long se_weight(struct sched_entity *se)
 {
 	return scale_load_down(se->load.weight);
 }
 
+static inline long se_runnable(struct sched_entity *se)
+{
+	return scale_load_down(se->runnable_weight);
+}
+
+static inline void
+enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	cfs_rq->runnable_weight += se->runnable_weight;
+
+	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
+	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
+}
+
+static inline void
+dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	cfs_rq->runnable_weight -= se->runnable_weight;
+
+	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
+	sub_positive(&cfs_rq->avg.runnable_load_sum,
+		     se_runnable(se) * se->avg.runnable_load_sum);
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight)
+			    unsigned long weight, unsigned long runnable)
 {
 	unsigned long se_load_avg = se->avg.load_avg;
 	u64 se_load_sum = se_weight(se) * se->avg.load_sum;
 	u64 new_load_sum = scale_load_down(weight) * se->avg.load_sum;
+	u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
+
 		account_entity_dequeue(cfs_rq, se);
+		dequeue_entity_load_avg(cfs_rq, se);
 	}
 
-	se->avg.load_avg = div_u64(new_load_sum,
-			LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
+	se->avg.load_avg = div_u64(new_load_sum, divider);
+	se->avg.runnable_load_avg =
+		div_u64(scale_load_down(runnable) * se->avg.runnable_load_sum, divider);
 
+	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 	if (se->on_rq) {
+		/* XXX delta accounting for these */
+
 		account_entity_enqueue(cfs_rq, se);
-		add_positive(&cfs_rq->runnable_load_avg,
-				(long)(se->avg.load_avg - se_load_avg));
-		add_positive(&cfs_rq->runnable_load_sum,
-				(s64)(new_load_sum - se_load_sum));
+		enqueue_entity_load_avg(cfs_rq, se);
 	}
 
 	add_positive(&cfs_rq->avg.load_avg,
@@ -2797,31 +2843,45 @@ static void reweight_entity(struct cfs_r
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 
-static void update_cfs_shares(struct sched_entity *se)
+/*
+ * Recomputes the group entity based on the current state of its group
+ * runqueue.
+ */
+static void update_cfs_group(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	long shares;
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long shares, runnable;
 
-	if (!cfs_rq)
+	if (!gcfs_rq)
 		return;
 
-	if (throttled_hierarchy(cfs_rq))
+	if (throttled_hierarchy(gcfs_rq))
 		return;
 
 #ifndef CONFIG_SMP
-	shares = READ_ONCE(cfs_rq->tg->shares);
+	shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
-	shares = calc_cfs_shares(cfs_rq);
+	shares = calc_cfs_shares(gcfs_rq);
 #endif
+	/*
+	 * The hierarchical runnable load metric is the proportional part
+	 * of this group's runnable_load_avg / load_avg.
+	 *
+	 * Note: we need to deal with very sporadic 'runnable > load' cases
+	 * due to numerical instability.
+	 */
+	runnable = shares * gcfs_rq->avg.runnable_load_avg;
+	if (runnable)
+		runnable /= max(gcfs_rq->avg.load_avg, gcfs_rq->avg.runnable_load_avg);
 
-	reweight_entity(cfs_rq_of(se), se, shares);
+	reweight_entity(cfs_rq_of(se), se, shares, runnable);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_shares(struct sched_entity *se)
+static inline void update_cfs_group(struct sched_entity *se)
 {
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -2905,7 +2965,7 @@ static u32 __accumulate_pelt_segments(u6
  */
 static __always_inline u32
 accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
-	       unsigned long weight, int running, struct cfs_rq *cfs_rq)
+	       unsigned long load, unsigned long runnable, int running)
 {
 	unsigned long scale_freq, scale_cpu;
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
@@ -2922,10 +2982,8 @@ accumulate_sum(u64 delta, int cpu, struc
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_sum =
-				decay_load(cfs_rq->runnable_load_sum, periods);
-		}
+		sa->runnable_load_sum =
+			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -2938,11 +2996,10 @@ accumulate_sum(u64 delta, int cpu, struc
 	sa->period_contrib = delta;
 
 	contrib = cap_scale(contrib, scale_freq);
-	if (weight) {
-		sa->load_sum += weight * contrib;
-		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * contrib;
-	}
+	if (load)
+		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib * scale_cpu;
 
@@ -2979,7 +3036,7 @@ accumulate_sum(u64 delta, int cpu, struc
  */
 static __always_inline int
 ___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
-		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
+		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
 
@@ -3010,45 +3067,60 @@ ___update_load_sum(u64 now, int cpu, str
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
+	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long weight, struct cfs_rq *cfs_rq)
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
 	/*
 	 * Step 2: update *_avg.
 	 */
-	sa->load_avg = div_u64(weight * sa->load_sum, divider);
-	if (cfs_rq) {
-		cfs_rq->runnable_load_avg =
-			div_u64(cfs_rq->runnable_load_sum, divider);
-	}
+	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	sa->util_avg = sa->util_sum / divider;
 }
 
 /*
  * sched_entity:
  *
+ *   task:
+ *     se_runnable() == se_weight()
+ *
+ *   group: [ see update_cfs_group() ]
+ *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
  *   load_sum := runnable_sum
  *   load_avg = se_weight(se) * runnable_avg
  *
+ *   runnable_load_sum := runnable_sum
+ *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
  * cfq_rs:
  *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
+ *
+ *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 static int
 __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
 {
-	if (___update_load_sum(now, cpu, &se->avg, 0, 0, NULL)) {
-		___update_load_avg(&se->avg, se_weight(se), NULL);
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		return 1;
 	}
 
@@ -3058,10 +3130,13 @@ __update_load_avg_blocked_se(u64 now, in
 static int
 __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq,
-				cfs_rq->curr == se, NULL)) {
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+				cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), NULL);
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		return 1;
 	}
 
@@ -3073,8 +3148,10 @@ __update_load_avg_cfs_rq(u64 now, int cp
 {
 	if (___update_load_sum(now, cpu, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				cfs_rq->curr != NULL, cfs_rq)) {
-		___update_load_avg(&cfs_rq->avg, 1, cfs_rq);
+				scale_load_down(cfs_rq->runnable_weight),
+				cfs_rq->curr != NULL)) {
+
+		___update_load_avg(&cfs_rq->avg, 1, 1);
 		return 1;
 	}
 
@@ -3253,8 +3330,8 @@ static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long load_avg;
-	s64 load_sum;
+	long runnable_load_avg, load_avg;
+	s64 runnable_load_sum, load_sum;
 
 	if (!runnable_sum)
 		return;
@@ -3270,9 +3347,15 @@ update_tg_cfs_runnable(struct cfs_rq *cf
 	add_positive(&cfs_rq->avg.load_avg, load_avg);
 	add_positive(&cfs_rq->avg.load_sum, load_sum);
 
+	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
+	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+
+	add_positive(&se->avg.runnable_load_sum, runnable_sum);
+	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+
 	if (se->on_rq) {
-		add_positive(&cfs_rq->runnable_load_avg, load_avg);
-		add_positive(&cfs_rq->runnable_load_sum, load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
 	}
 }
 
@@ -3372,23 +3455,6 @@ static inline void cfs_rq_util_change(st
 	}
 }
 
-/*
- * Unsigned subtract and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define sub_positive(_ptr, _val) do {				\
-	typeof(_ptr) ptr = (_ptr);				\
-	typeof(*ptr) val = (_val);				\
-	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
-	res = var - val;					\
-	if (res > var)						\
-		res = 0;					\
-	WRITE_ONCE(*ptr, res);					\
-} while (0)
-
 /**
  * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
  * @now: current time, as per cfs_rq_clock_task()
@@ -3529,22 +3595,6 @@ static inline void update_load_avg(struc
 		update_tg_load_avg(cfs_rq, 0);
 }
 
-/* Add the load generated by se into cfs_rq's load average */
-static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_load_avg += se->avg.load_avg;
-	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
-}
-
-/* Remove the runnable load generated by se from cfs_rq's runnable load average */
-static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
-}
-
 #ifndef CONFIG_64BIT
 static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 {
@@ -3614,7 +3664,7 @@ void remove_entity_load_avg(struct sched
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->runnable_load_avg;
+	return cfs_rq->avg.runnable_load_avg;
 }
 
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
@@ -3790,8 +3840,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	update_cfs_group(se);
 	enqueue_entity_load_avg(cfs_rq, se);
-	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -3897,7 +3947,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
-	update_cfs_shares(se);
+	update_cfs_group(se);
 
 	/*
 	 * Now advance min_vruntime if @se was the entity holding it back,
@@ -4080,7 +4130,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 * Ensure that runnable average is periodically updated.
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
-	update_cfs_shares(curr);
+	update_cfs_group(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
 	/*
@@ -4998,7 +5048,7 @@ enqueue_task_fair(struct rq *rq, struct
 			break;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_shares(se);
+		update_cfs_group(se);
 	}
 
 	if (!se)
@@ -5057,7 +5107,7 @@ static void dequeue_task_fair(struct rq
 			break;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_shares(se);
+		update_cfs_group(se);
 	}
 
 	if (!se)
@@ -7122,7 +7172,7 @@ static inline bool cfs_rq_is_decayed(str
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->runnable_load_sum)
+	if (cfs_rq->avg.runnable_load_sum)
 		return false;
 
 	return true;
@@ -9595,7 +9645,7 @@ int sched_group_set_shares(struct task_g
 		update_rq_clock(rq);
 		for_each_sched_entity(se) {
 			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
-			update_cfs_shares(se);
+			update_cfs_group(se);
 		}
 		rq_unlock_irqrestore(rq, &rf);
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -396,6 +396,7 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
+	unsigned long runnable_weight;
 	unsigned int nr_running, h_nr_running;
 
 	u64 exec_clock;
@@ -422,8 +423,6 @@ struct cfs_rq {
 	 * CFS load tracking
 	 */
 	struct sched_avg avg;
-	u64 runnable_load_sum;
-	unsigned long runnable_load_avg;
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 10/14] sched/fair: more obvious
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 09/14] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 11/14] sched/fair: Synchonous PELT detach on load-balance migrate Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-frobbing.patch --]
[-- Type: text/plain, Size: 4856 bytes --]

one mult worse, but more obvious code

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   57 +++++++++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2786,7 +2786,7 @@ static inline long se_runnable(struct sc
 }
 
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	cfs_rq->runnable_weight += se->runnable_weight;
 
@@ -2795,7 +2795,7 @@ enqueue_entity_load_avg(struct cfs_rq *c
 }
 
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	cfs_rq->runnable_weight -= se->runnable_weight;
 
@@ -2804,12 +2804,23 @@ dequeue_entity_load_avg(struct cfs_rq *c
 		     se_runnable(se) * se->avg.runnable_load_sum);
 }
 
+static inline void
+__add_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	cfs_rq->avg.load_avg += se->avg.load_avg;
+	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+}
+
+static inline void
+__sub_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight, unsigned long runnable)
 {
-	unsigned long se_load_avg = se->avg.load_avg;
-	u64 se_load_sum = se_weight(se) * se->avg.load_sum;
-	u64 new_load_sum = scale_load_down(weight) * se->avg.load_sum;
 	u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 	if (se->on_rq) {
@@ -2818,27 +2829,22 @@ static void reweight_entity(struct cfs_r
 			update_curr(cfs_rq);
 
 		account_entity_dequeue(cfs_rq, se);
-		dequeue_entity_load_avg(cfs_rq, se);
+		dequeue_runnable_load_avg(cfs_rq, se);
 	}
-
-	se->avg.load_avg = div_u64(new_load_sum, divider);
-	se->avg.runnable_load_avg =
-		div_u64(scale_load_down(runnable) * se->avg.runnable_load_sum, divider);
+	__sub_load_avg(cfs_rq, se);
 
 	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
-	if (se->on_rq) {
-		/* XXX delta accounting for these */
+	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
+	se->avg.runnable_load_avg =
+		div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 
+	__add_load_avg(cfs_rq, se);
+	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
-		enqueue_entity_load_avg(cfs_rq, se);
+		enqueue_runnable_load_avg(cfs_rq, se);
 	}
-
-	add_positive(&cfs_rq->avg.load_avg,
-			(long)(se->avg.load_avg - se_load_avg));
-	add_positive(&cfs_rq->avg.load_sum,
-			(s64)(new_load_sum - se_load_sum));
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3523,8 +3529,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
-	cfs_rq->avg.load_avg += se->avg.load_avg;
-	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+	__add_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 
@@ -3543,9 +3548,7 @@ static void attach_entity_load_avg(struc
  */
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-
-	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+	__sub_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 
@@ -3685,9 +3688,9 @@ static inline void update_load_avg(struc
 }
 
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
 
 static inline void
@@ -3834,7 +3837,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	update_cfs_group(se);
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -3917,7 +3920,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 11/14] sched/fair: Synchonous PELT detach on load-balance migrate
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 10/14] sched/fair: more obvious Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 12/14] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-syn-migrate.patch --]
[-- Type: text/plain, Size: 2682 bytes --]

Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.

While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3649,10 +3649,6 @@ void remove_entity_load_avg(struct sched
 	 * Similarly for groups, they will have passed through
 	 * post_init_entity_util_avg() before unregister_sched_fair_group()
 	 * calls this.
-	 *
-	 * XXX in case entity_is_task(se) && task_of(se)->on_rq == MIGRATING
-	 * we could actually get the right time, since we're called with
-	 * rq->lock held, see detach_task().
 	 */
 
 	sync_entity_load_avg(se);
@@ -6251,6 +6247,8 @@ select_task_rq_fair(struct task_struct *
 	return new_cpu;
 }
 
+static void detach_entity_cfs_rq(struct sched_entity *se);
+
 /*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6284,14 +6282,25 @@ static void migrate_task_rq_fair(struct
 		se->vruntime -= min_vruntime;
 	}
 
-	/*
-	 * We are supposed to update the task to "current" time, then its up to date
-	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
-	 * what current time is, so simply throw away the out-of-date time. This
-	 * will result in the wakee task is less decayed, but giving the wakee more
-	 * load sounds not bad.
-	 */
-	remove_entity_load_avg(&p->se);
+	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
+		/*
+		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
+		 * rq->lock and can modify state directly.
+		 */
+		lockdep_assert_held(&task_rq(p)->lock);
+		detach_entity_cfs_rq(&p->se);
+
+	} else {
+		/*
+		 * We are supposed to update the task to "current" time, then
+		 * its up to date and ready to go to new CPU/cfs_rq. But we
+		 * have difficulty in getting what current time is, so simply
+		 * throw away the out-of-date time. This will result in the
+		 * wakee task is less decayed, but giving the wakee more load
+		 * sounds not bad.
+		 */
+		remove_entity_load_avg(&p->se);
+	}
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 12/14] sched/fair: Cure calc_cfs_shares() vs reweight_entity()
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (10 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 11/14] sched/fair: Synchonous PELT detach on load-balance migrate Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 13/14] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-calc_cfs_shares-fixup.patch --]
[-- Type: text/plain, Size: 1297 bytes --]

Vincent reported that when running in a cgroup, his root
cfs_rq->avg.load_avg dropped to 0 on task idle.

This is because reweight_entity() will now immediately propagate the
weight change of the group entity to its cfs_rq, and as it happens,
our approxmation (5) for calc_cfs_shares() results in 0 when the group
is idle.

Avoid this by using the correct (3) as a lower bound on (5). This way
the empty cgroup will slowly decay instead of instantly drop to 0.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2703,11 +2703,10 @@ static long calc_cfs_shares(struct cfs_r
 	tg_shares = READ_ONCE(tg->shares);
 
 	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
+	 * Because (5) drops to 0 when the cfs_rq is idle, we need to use (3)
+	 * as a lower bound.
 	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 13/14] sched/fair: Align PELT windows between cfs_rq and its se
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (11 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 12/14] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-12 16:44 ` [RFC][PATCH 14/14] sched/fair: More accurate async detach Peter Zijlstra
  2017-05-16 22:02 ` [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Tejun Heo
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-align-windows.patch --]
[-- Type: text/plain, Size: 3219 bytes --]

The PELT _sum values are a saw-tooth function, dropping on the decay
edge and then growing back up again during the window.

When these window-edges are not aligned between cfs_rq and se, we can
have the situation where, for example, on dequeue, the se decays
first.

Its _sum values will be small(er), while the cfs_rq _sum values will
still be on their way up. Because of this, the subtraction:
cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
will then, once the cfs_rq reaches an edge, translate into its _avg
value jumping up.

This is especially visible with the runnable_load bits, since they get
added/subtracted a lot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -729,13 +729,8 @@ void init_entity_runnable_average(struct
 {
 	struct sched_avg *sa = &se->avg;
 
-	sa->last_update_time = 0;
-	/*
-	 * sched_avg's period_contrib should be strictly less then 1024, so
-	 * we give it 1023 to make sure it is almost a period (1024us), and
-	 * will definitely be update (after enqueue).
-	 */
-	sa->period_contrib = 1023;
+	memset(sa, 0, sizeof(*sa));
+
 	/*
 	 * Tasks are intialized with full load to be seen as heavy tasks until
 	 * they get a chance to stabilize to their real load level.
@@ -744,13 +739,9 @@ void init_entity_runnable_average(struct
 	 */
 	if (entity_is_task(se))
 		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-	sa->runnable_load_sum = sa->load_sum = LOAD_AVG_MAX;
 
-	/*
-	 * At this point, util_avg won't be used in select_task_rq_fair anyway
-	 */
-	sa->util_avg = 0;
-	sa->util_sum = 0;
+	se->runnable_weight = se->load.weight;
+
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
@@ -798,7 +789,6 @@ void post_init_entity_util_avg(struct sc
 		} else {
 			sa->util_avg = cap;
 		}
-		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
 	if (entity_is_task(se)) {
@@ -3527,7 +3517,34 @@ update_cfs_rq_load_avg(u64 now, struct c
  */
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
+
+	/*
+	 * When we attach the @se to the @cfs_rq, we must align the decay
+	 * window because without that, really weird and wonderful things can
+	 * happen.
+	 *
+	 * XXX illustrate
+	 */
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
+	se->avg.period_contrib = cfs_rq->avg.period_contrib;
+
+	/*
+	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
+	 * period_contrib. This isn't strictly correct, but since we're
+	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
+	 * _sum a little.
+	 */
+	se->avg.util_sum = se->avg.util_avg * divider;
+
+	se->avg.load_sum = divider;
+	if (se_weight(se)) {
+		se->avg.load_sum =
+			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
+	}
+
+	se->avg.runnable_load_sum = se->avg.load_sum;
+
 	__add_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 14/14] sched/fair: More accurate async detach
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (12 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 13/14] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
@ 2017-05-12 16:44 ` Peter Zijlstra
  2017-05-16 22:02 ` [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Tejun Heo
  14 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-12 16:44 UTC (permalink / raw)
  To: mingo, linux-kernel, tj
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-more-accurate-remove.patch --]
[-- Type: text/plain, Size: 1379 bytes --]

The problem with the overestimate is that it will subtract too big a
value from the load_sum, thereby pushing it down further than it ought
to go. Since runnable_load_avg is not subject to a similar 'force',
this results in the occasional 'runnable_load > load' situation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3469,6 +3469,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 
 	if (cfs_rq->removed.nr) {
 		unsigned long r;
+		u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
@@ -3477,17 +3478,13 @@ update_cfs_rq_load_avg(u64 now, struct c
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
-		/*
-		 * The LOAD_AVG_MAX for _sum is a slight over-estimate,
-		 * which is safe due to sub_positive() clipping at 0.
-		 */
 		r = removed_load;
 		sub_positive(&sa->load_avg, r);
-		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
+		sub_positive(&sa->load_sum, r * divider);
 
 		r = removed_util;
 		sub_positive(&sa->util_avg, r);
-		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
+		sub_positive(&sa->util_sum, r * divider);
 
 		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares()
  2017-05-12 16:44 ` [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
@ 2017-05-16  8:02   ` Vincent Guittot
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Guittot @ 2017-05-16  8:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, Dietmar Eggemann,
	Morten Rasmussen, Ben Segall, Yuyang Du

On 12 May 2017 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
> For consistencies sake, we should have only a single reading of
> tg->shares.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again)..
  2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
                   ` (13 preceding siblings ...)
  2017-05-12 16:44 ` [RFC][PATCH 14/14] sched/fair: More accurate async detach Peter Zijlstra
@ 2017-05-16 22:02 ` Tejun Heo
  2017-05-17  6:53   ` Peter Zijlstra
  14 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2017-05-16 22:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, torvalds, vincent.guittot, efault, pjt, clm,
	dietmar.eggemann, morten.rasmussen, bsegall, yuyang.du

Hello,

On Fri, May 12, 2017 at 06:44:16PM +0200, Peter Zijlstra wrote:
> 
> Hi all,
> 
> So after staring at all that PELT stuff and working my way through it again:
> 
>   https://lkml.kernel.org/r/20170505154117.6zldxuki2fgyo53n@hirez.programming.kicks-ass.net
> 
> I started doing some patches to fix some of the identified broken.
> 
> So here are a few too many patches that do:
> 
>  - fix 'reweight_entity' to instantly propagate the change in se->load.weight.
> 
>  - rewrite/fix the propagate on migrate (attach/detach)
> 
>  - introduce the hierarchical runnable_load_avg, as proposed by Tejun.
> 
>  - synchronous detach for runnable migrates
> 
>  - aligns the PELT windows between a cfs_rq and all its se's
> 
>  - deals with random fallout from the above (some of this needs folding back
>    and reordering, but its all well past the point I should post this anyway).

The schbench results look good here.  I'll put it under more realistic
test tomorrow and see how it performs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again)..
  2017-05-16 22:02 ` [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Tejun Heo
@ 2017-05-17  6:53   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-17  6:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mingo, linux-kernel, torvalds, vincent.guittot, efault, pjt, clm,
	dietmar.eggemann, morten.rasmussen, bsegall, yuyang.du, jbacik

On Tue, May 16, 2017 at 06:02:17PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 12, 2017 at 06:44:16PM +0200, Peter Zijlstra wrote:
> > 
> > Hi all,
> > 
> > So after staring at all that PELT stuff and working my way through it again:
> > 
> >   https://lkml.kernel.org/r/20170505154117.6zldxuki2fgyo53n@hirez.programming.kicks-ass.net
> > 
> > I started doing some patches to fix some of the identified broken.
> > 
> > So here are a few too many patches that do:
> > 
> >  - fix 'reweight_entity' to instantly propagate the change in se->load.weight.
> > 
> >  - rewrite/fix the propagate on migrate (attach/detach)
> > 
> >  - introduce the hierarchical runnable_load_avg, as proposed by Tejun.
> > 
> >  - synchronous detach for runnable migrates
> > 
> >  - aligns the PELT windows between a cfs_rq and all its se's
> > 
> >  - deals with random fallout from the above (some of this needs folding back
> >    and reordering, but its all well past the point I should post this anyway).
> 
> The schbench results look good here.  I'll put it under more realistic
> test tomorrow and see how it performs.

Josef reported some problems with these patches and narrowed it down to:

  "sched/fair: Remove se->load.weight from se->avg.load_sum"

I'm currently stuck chasing a regression in other code, but I'll try and
get back to this shortly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-05-12 16:44 ` [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
@ 2017-05-17  7:04   ` Vincent Guittot
  2017-05-17  9:50     ` Vincent Guittot
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2017-05-17  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, Dietmar Eggemann,
	Morten Rasmussen, Ben Segall, Yuyang Du

Hi Peter,

On 12 May 2017 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
> Remove the load from the load_sum for sched_entities, basically
> turning load_sum into runnable_sum.  This prepares for better
> reweighting of group entities.
>
> Since we now have different rules for computing load_avg, split
> ___update_load_avg() into two parts, ___update_load_sum() and
> ___update_load_avg().
>
> So for se:
>
>   ___update_load_sum(.weight = 1)
>   ___upate_load_avg(.weight = se->load.weight)
>
> and for cfs_rq:
>
>   ___update_load_sum(.weight = cfs_rq->load.weight)
>   ___upate_load_avg(.weight = 1)
>
> Since the primary consumable is load_avg, most things will not be
> affected. Only those few sites that initialize/modify load_sum need
> attention.

I wonder if there is a problem with this new way to compute se's
load_avg and cfs_rq's load_avg when a task changes is nice prio before
migrating to another CPU.

se load_avg is now: runnable x current weight
but cfs_rq load_avg keeps the history of the previous weight of the se
When we detach se, we will remove an up to date se's load_avg from
cfs_rq which doesn't have the up to date load_avg in its own load_avg.
So if se's prio decreases just before migrating, some load_avg stays
in prev cfs_rq and if se's prio increases, we will remove too much
load_avg and possibly make the cfs_rq load_avg null whereas other
tasks are running.

Thought ?

I'm able to reproduce the problem with a simple rt-app use case (after
adding a new feature in rt-app)

Vincent

>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   91 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 64 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -744,7 +744,7 @@ void init_entity_runnable_average(struct
>          */
>         if (entity_is_task(se))
>                 sa->load_avg = scale_load_down(se->load.weight);
> -       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
> +       sa->load_sum = LOAD_AVG_MAX;
>         /*
>          * At this point, util_avg won't be used in select_task_rq_fair anyway
>          */
> @@ -1967,7 +1967,7 @@ static u64 numa_get_avg_runtime(struct t
>                 delta = runtime - p->last_sum_exec_runtime;
>                 *period = now - p->last_task_numa_placement;
>         } else {
> -               delta = p->se.avg.load_sum / p->se.load.weight;
> +               delta = p->se.avg.load_sum;
>                 *period = LOAD_AVG_MAX;
>         }
>
> @@ -2872,8 +2872,8 @@ accumulate_sum(u64 delta, int cpu, struc
>   *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
>   */
>  static __always_inline int
> -___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> -                 unsigned long weight, int running, struct cfs_rq *cfs_rq)
> +___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
> +                  unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
>         u64 delta;
>
> @@ -2907,39 +2907,80 @@ ___update_load_avg(u64 now, int cpu, str
>         if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
>                 return 0;
>
> +       return 1;
> +}
> +
> +static __always_inline void
> +___update_load_avg(struct sched_avg *sa, unsigned long weight, struct cfs_rq *cfs_rq)
> +{
> +       u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> +
>         /*
>          * Step 2: update *_avg.
>          */
> -       sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
> +       sa->load_avg = div_u64(weight * sa->load_sum, divider);
>         if (cfs_rq) {
>                 cfs_rq->runnable_load_avg =
> -                       div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
> +                       div_u64(cfs_rq->runnable_load_sum, divider);
>         }
> -       sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
> +       sa->util_avg = sa->util_sum / divider;
> +}
>
> -       return 1;
> +/*
> + * XXX we want to get rid of this helper and use the full load resolution.
> + */
> +static inline long se_weight(struct sched_entity *se)
> +{
> +       return scale_load_down(se->load.weight);
>  }
>
> +/*
> + * sched_entity:
> + *
> + *   load_sum := runnable_sum
> + *   load_avg = se_weight(se) * runnable_avg
> + *
> + * cfq_rs:
> + *
> + *   load_sum = \Sum se_weight(se) * se->avg.load_sum
> + *   load_avg = \Sum se->avg.load_avg
> + */
> +
>  static int
>  __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
>  {
> -       return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL);
> +       if (___update_load_sum(now, cpu, &se->avg, 0, 0, NULL)) {
> +               ___update_load_avg(&se->avg, se_weight(se), NULL);
> +               return 1;
> +       }
> +
> +       return 0;
>  }
>
>  static int
>  __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       return ___update_load_avg(now, cpu, &se->avg,
> -                                 se->on_rq * scale_load_down(se->load.weight),
> -                                 cfs_rq->curr == se, NULL);
> +       if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq,
> +                               cfs_rq->curr == se, NULL)) {
> +
> +               ___update_load_avg(&se->avg, se_weight(se), NULL);
> +               return 1;
> +       }
> +
> +       return 0;
>  }
>
>  static int
>  __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>  {
> -       return ___update_load_avg(now, cpu, &cfs_rq->avg,
> -                       scale_load_down(cfs_rq->load.weight),
> -                       cfs_rq->curr != NULL, cfs_rq);
> +       if (___update_load_sum(now, cpu, &cfs_rq->avg,
> +                               scale_load_down(cfs_rq->load.weight),
> +                               cfs_rq->curr != NULL, cfs_rq)) {
> +               ___update_load_avg(&cfs_rq->avg, 1, cfs_rq);
> +               return 1;
> +       }
> +
> +       return 0;
>  }
>
>  /*
> @@ -3110,7 +3151,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>
>         /* Set new sched_entity's load */
>         se->avg.load_avg = load;
> -       se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
> +       se->avg.load_sum = LOAD_AVG_MAX;
>
>         /* Update parent cfs_rq load */
>         add_positive(&cfs_rq->avg.load_avg, delta);
> @@ -3340,7 +3381,7 @@ static void attach_entity_load_avg(struc
>  {
>         se->avg.last_update_time = cfs_rq->avg.last_update_time;
>         cfs_rq->avg.load_avg += se->avg.load_avg;
> -       cfs_rq->avg.load_sum += se->avg.load_sum;
> +       cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
>         cfs_rq->avg.util_avg += se->avg.util_avg;
>         cfs_rq->avg.util_sum += se->avg.util_sum;
>         set_tg_cfs_propagate(cfs_rq);
> @@ -3360,7 +3401,7 @@ static void detach_entity_load_avg(struc
>  {
>
>         sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
> -       sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
> +       sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
>         sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
>         sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
>         set_tg_cfs_propagate(cfs_rq);
> @@ -3372,12 +3413,10 @@ static void detach_entity_load_avg(struc
>  static inline void
>  enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       struct sched_avg *sa = &se->avg;
> -
> -       cfs_rq->runnable_load_avg += sa->load_avg;
> -       cfs_rq->runnable_load_sum += sa->load_sum;
> +       cfs_rq->runnable_load_avg += se->avg.load_avg;
> +       cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
>
> -       if (!sa->last_update_time) {
> +       if (!se->avg.last_update_time) {
>                 attach_entity_load_avg(cfs_rq, se);
>                 update_tg_load_avg(cfs_rq, 0);
>         }
> @@ -3387,10 +3426,8 @@ enqueue_entity_load_avg(struct cfs_rq *c
>  static inline void
>  dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       cfs_rq->runnable_load_avg =
> -               max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> -       cfs_rq->runnable_load_sum =
> -               max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
> +       sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
> +       sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
>  }
>
>  #ifndef CONFIG_64BIT
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-05-17  7:04   ` Vincent Guittot
@ 2017-05-17  9:50     ` Vincent Guittot
  2017-05-17 14:20       ` Peter Zijlstra
  2017-09-29 20:11       ` [tip:sched/core] sched/fair: Use reweight_entity() for set_user_nice() tip-bot for Vincent Guittot
  0 siblings, 2 replies; 24+ messages in thread
From: Vincent Guittot @ 2017-05-17  9:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, Dietmar Eggemann,
	Morten Rasmussen, Ben Segall, Yuyang Du

Le Wednesday 17 May 2017 à 09:04:47 (+0200), Vincent Guittot a écrit :
> Hi Peter,
> 
> On 12 May 2017 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
> > Remove the load from the load_sum for sched_entities, basically
> > turning load_sum into runnable_sum.  This prepares for better
> > reweighting of group entities.
> >
> > Since we now have different rules for computing load_avg, split
> > ___update_load_avg() into two parts, ___update_load_sum() and
> > ___update_load_avg().
> >
> > So for se:
> >
> >   ___update_load_sum(.weight = 1)
> >   ___upate_load_avg(.weight = se->load.weight)
> >
> > and for cfs_rq:
> >
> >   ___update_load_sum(.weight = cfs_rq->load.weight)
> >   ___upate_load_avg(.weight = 1)
> >
> > Since the primary consumable is load_avg, most things will not be
> > affected. Only those few sites that initialize/modify load_sum need
> > attention.
> 
> I wonder if there is a problem with this new way to compute se's
> load_avg and cfs_rq's load_avg when a task changes is nice prio before
> migrating to another CPU.
> 
> se load_avg is now: runnable x current weight
> but cfs_rq load_avg keeps the history of the previous weight of the se
> When we detach se, we will remove an up to date se's load_avg from
> cfs_rq which doesn't have the up to date load_avg in its own load_avg.
> So if se's prio decreases just before migrating, some load_avg stays
> in prev cfs_rq and if se's prio increases, we will remove too much
> load_avg and possibly make the cfs_rq load_avg null whereas other
> tasks are running.
> 
> Thought ?
> 
> I'm able to reproduce the problem with a simple rt-app use case (after
> adding a new feature in rt-app)
> 
> Vincent
>

The hack below fixes the problem I mentioned above. It applies on task what is
done when updating group_entity's weight

---
 kernel/sched/core.c | 26 +++++++++++++++++++-------
 kernel/sched/fair.c | 20 ++++++++++++++++++++
 2 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 09e0996..f327ab6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -732,7 +732,9 @@ int tg_nop(struct task_group *tg, void *data)
 }
 #endif
 
-static void set_load_weight(struct task_struct *p)
+extern void reweight_task(struct task_struct *p, int prio);
+
+static void set_load_weight(struct task_struct *p, bool update_load)
 {
 	int prio = p->static_prio - MAX_RT_PRIO;
 	struct load_weight *load = &p->se.load;
@@ -746,8 +748,18 @@ static void set_load_weight(struct task_struct *p)
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
-	load->inv_weight = sched_prio_to_wmult[prio];
+	/*
+	 * SCHED_OTHER tasks have to update their load when changing their
+	 * weight
+	 */
+	if (update_load) {
+		reweight_task(p, prio);
+	} else {
+		load->weight = scale_load(sched_prio_to_weight[prio]);
+		load->inv_weight = sched_prio_to_wmult[prio];
+	}
+
+
 }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2373,7 +2385,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 			p->static_prio = NICE_TO_PRIO(0);
 
 		p->prio = p->normal_prio = __normal_prio(p);
-		set_load_weight(p);
+		set_load_weight(p, false);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -3854,7 +3866,7 @@ void set_user_nice(struct task_struct *p, long nice)
 		put_prev_task(rq, p);
 
 	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p);
+	set_load_weight(p, true);
 	old_prio = p->prio;
 	p->prio = effective_prio(p);
 	delta = p->prio - old_prio;
@@ -4051,7 +4063,7 @@ static void __setscheduler_params(struct task_struct *p,
 	 */
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
-	set_load_weight(p);
+	set_load_weight(p, fair_policy(policy));
 }
 
 /* Actually do priority change: must hold pi & rq lock. */
@@ -6152,7 +6164,7 @@ void __init sched_init(void)
 		atomic_set(&rq->nr_iowait, 0);
 	}
 
-	set_load_weight(&init_task);
+	set_load_weight(&init_task, false);
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd2f9f5..3853e34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2825,6 +2825,26 @@ __sub_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
 }
 
+void reweight_task(struct task_struct *p, int prio)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct load_weight *load = &p->se.load;
+
+	u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
+
+	__sub_load_avg(cfs_rq, se);
+
+	load->weight = scale_load(sched_prio_to_weight[prio]);
+	load->inv_weight = sched_prio_to_wmult[prio];
+
+	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
+	se->avg.runnable_load_avg =
+		div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
+
+	__add_load_avg(cfs_rq, se);
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight, unsigned long runnable)
 {
-- 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 05/14] sched/fair: Change update_load_avg() arguments
  2017-05-12 16:44 ` [RFC][PATCH 05/14] sched/fair: Change update_load_avg() arguments Peter Zijlstra
@ 2017-05-17 10:46   ` Vincent Guittot
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Guittot @ 2017-05-17 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, Dietmar Eggemann,
	Morten Rasmussen, Ben Segall, Yuyang Du

On 12 May 2017 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
> Most call sites of update_load_avg() already have cfs_rq_of(se)
> available, pass it down instead of recomputing it.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-05-17  9:50     ` Vincent Guittot
@ 2017-05-17 14:20       ` Peter Zijlstra
  2017-09-29 20:11       ` [tip:sched/core] sched/fair: Use reweight_entity() for set_user_nice() tip-bot for Vincent Guittot
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-05-17 14:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, Dietmar Eggemann,
	Morten Rasmussen, Ben Segall, Yuyang Du

On Wed, May 17, 2017 at 11:50:45AM +0200, Vincent Guittot wrote:
> Le Wednesday 17 May 2017 à 09:04:47 (+0200), Vincent Guittot a écrit :

> > I wonder if there is a problem with this new way to compute se's
> > load_avg and cfs_rq's load_avg when a task changes is nice prio before
> > migrating to another CPU.
> > 
> > se load_avg is now: runnable x current weight
> > but cfs_rq load_avg keeps the history of the previous weight of the se
> > When we detach se, we will remove an up to date se's load_avg from
> > cfs_rq which doesn't have the up to date load_avg in its own load_avg.
> > So if se's prio decreases just before migrating, some load_avg stays
> > in prev cfs_rq and if se's prio increases, we will remove too much
> > load_avg and possibly make the cfs_rq load_avg null whereas other
> > tasks are running.
> > 
> > Thought ?
> > 
> > I'm able to reproduce the problem with a simple rt-app use case (after
> > adding a new feature in rt-app)
> > 
> > Vincent
> >
> 
> The hack below fixes the problem I mentioned above. It applies on task what is
> done when updating group_entity's weight

Right. I didn't think people much used nice so I skipped it for now. Yes
your patch looks about right for that.

I'll include it, thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 06/14] sched/fair: Move enqueue migrate handling
  2017-05-12 16:44 ` [RFC][PATCH 06/14] sched/fair: Move enqueue migrate handling Peter Zijlstra
@ 2017-05-29 13:41   ` Vincent Guittot
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Guittot @ 2017-05-29 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, Dietmar Eggemann,
	Morten Rasmussen, Ben Segall, Yuyang Du

On 12 May 2017 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
> Move the entity migrate handling from enqueue_entity_load_avg() to
> update_load_avg(). This has two benefits:
>
>  - {en,de}queue_entity_load_avg() will become purely about managing
>    runnable_load
>
>  - we can avoid a double update_tg_load_avg() and reduce pressure on
>    the global tg->shares cacheline
>
> The reason we do this is so that we can change update_cfs_shares() to
> change both weight and (future) runnable_weight. For this to work we
> need to have the cfs_rq averages up-to-date (which means having done
> the attach), but we need the cfs_rq->avg.runnable_avg to not yet
> include the se's contribution (since se->on_rq == 0).
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c |   70 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 36 insertions(+), 34 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3418,34 +3418,6 @@ update_cfs_rq_load_avg(u64 now, struct c
>         return decayed || removed_load;
>  }
>
> -/*
> - * Optional action to be done while updating the load average
> - */
> -#define UPDATE_TG      0x1
> -#define SKIP_AGE_LOAD  0x2
> -
> -/* Update task and its cfs_rq load average */
> -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> -{
> -       u64 now = cfs_rq_clock_task(cfs_rq);
> -       struct rq *rq = rq_of(cfs_rq);
> -       int cpu = cpu_of(rq);
> -       int decayed;
> -
> -       /*
> -        * Track task load average for carrying it to new CPU after migrated, and
> -        * track group sched_entity load average for task_h_load calc in migration
> -        */
> -       if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> -               __update_load_avg_se(now, cpu, cfs_rq, se);
> -
> -       decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
> -       decayed |= propagate_entity_load_avg(se);
> -
> -       if (decayed && (flags & UPDATE_TG))
> -               update_tg_load_avg(cfs_rq, 0);
> -}
> -
>  /**
>   * attach_entity_load_avg - attach this entity to its cfs_rq load avg
>   * @cfs_rq: cfs_rq to attach to
> @@ -3486,17 +3458,46 @@ static void detach_entity_load_avg(struc
>         cfs_rq_util_change(cfs_rq);
>  }
>
> +/*
> + * Optional action to be done while updating the load average
> + */
> +#define UPDATE_TG      0x1
> +#define SKIP_AGE_LOAD  0x2
> +#define DO_ATTACH      0x4
> +
> +/* Update task and its cfs_rq load average */
> +static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> +       u64 now = cfs_rq_clock_task(cfs_rq);
> +       struct rq *rq = rq_of(cfs_rq);
> +       int cpu = cpu_of(rq);
> +       int decayed;
> +
> +       /*
> +        * Track task load average for carrying it to new CPU after migrated, and
> +        * track group sched_entity load average for task_h_load calc in migration
> +        */
> +       if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> +               __update_load_avg_se(now, cpu, cfs_rq, se);
> +
> +       decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
> +       decayed |= propagate_entity_load_avg(se);
> +
> +       if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> +
> +               attach_entity_load_avg(cfs_rq, se);
> +               update_tg_load_avg(cfs_rq, 0);
> +
> +       } else if (decayed && (flags & UPDATE_TG))
> +               update_tg_load_avg(cfs_rq, 0);
> +}
> +
>  /* Add the load generated by se into cfs_rq's load average */
>  static inline void
>  enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         cfs_rq->runnable_load_avg += se->avg.load_avg;
>         cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
> -
> -       if (!se->avg.last_update_time) {
> -               attach_entity_load_avg(cfs_rq, se);
> -               update_tg_load_avg(cfs_rq, 0);
> -       }
>  }
>
>  /* Remove the runnable load generated by se from cfs_rq's runnable load average */
> @@ -3586,6 +3587,7 @@ update_cfs_rq_load_avg(u64 now, struct c
>
>  #define UPDATE_TG      0x0
>  #define SKIP_AGE_LOAD  0x0
> +#define DO_ATTACH      0x0
>
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
>  {
> @@ -3740,7 +3742,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>          *     its group cfs_rq
>          *   - Add its new weight to cfs_rq->load.weight
>          */
> -       update_load_avg(cfs_rq, se, UPDATE_TG);
> +       update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
>         enqueue_entity_load_avg(cfs_rq, se);
>         update_cfs_shares(se);
>         account_entity_enqueue(cfs_rq, se);
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [tip:sched/core] sched/fair: Use reweight_entity() for set_user_nice()
  2017-05-17  9:50     ` Vincent Guittot
  2017-05-17 14:20       ` Peter Zijlstra
@ 2017-09-29 20:11       ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 24+ messages in thread
From: tip-bot for Vincent Guittot @ 2017-09-29 20:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, linux-kernel, josef, hpa, tglx, vincent.guittot, mingo, peterz

Commit-ID:  9059393e4ec1c8c6623a120b405ef2c90b968d80
Gitweb:     https://git.kernel.org/tip/9059393e4ec1c8c6623a120b405ef2c90b968d80
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 17 May 2017 11:50:45 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 29 Sep 2017 19:35:14 +0200

sched/fair: Use reweight_entity() for set_user_nice()

Now that we directly change load_avg and propagate that change into
the sums, sys_nice() and co should do the same, otherwise its possible
to confuse load accounting when we migrate near the weight change.

Fixes-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added changelog, fixed the call condition. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 22 ++++++++++++------
 kernel/sched/fair.c  | 63 ++++++++++++++++++++++++++++++----------------------
 kernel/sched/sched.h |  2 ++
 3 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da..2288a14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -733,7 +733,7 @@ int tg_nop(struct task_group *tg, void *data)
 }
 #endif
 
-static void set_load_weight(struct task_struct *p)
+static void set_load_weight(struct task_struct *p, bool update_load)
 {
 	int prio = p->static_prio - MAX_RT_PRIO;
 	struct load_weight *load = &p->se.load;
@@ -747,8 +747,16 @@ static void set_load_weight(struct task_struct *p)
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
-	load->inv_weight = sched_prio_to_wmult[prio];
+	/*
+	 * SCHED_OTHER tasks have to update their load when changing their
+	 * weight
+	 */
+	if (update_load && p->sched_class == &fair_sched_class) {
+		reweight_task(p, prio);
+	} else {
+		load->weight = scale_load(sched_prio_to_weight[prio]);
+		load->inv_weight = sched_prio_to_wmult[prio];
+	}
 }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2358,7 +2366,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 			p->static_prio = NICE_TO_PRIO(0);
 
 		p->prio = p->normal_prio = __normal_prio(p);
-		set_load_weight(p);
+		set_load_weight(p, false);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -3805,7 +3813,7 @@ void set_user_nice(struct task_struct *p, long nice)
 		put_prev_task(rq, p);
 
 	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p);
+	set_load_weight(p, true);
 	old_prio = p->prio;
 	p->prio = effective_prio(p);
 	delta = p->prio - old_prio;
@@ -3962,7 +3970,7 @@ static void __setscheduler_params(struct task_struct *p,
 	 */
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
-	set_load_weight(p);
+	set_load_weight(p, true);
 }
 
 /* Actually do priority change: must hold pi & rq lock. */
@@ -5933,7 +5941,7 @@ void __init sched_init(void)
 		atomic_set(&rq->nr_iowait, 0);
 	}
 
-	set_load_weight(&init_task);
+	set_load_weight(&init_task, false);
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 750ae4d..d8c02b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2776,6 +2776,43 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->on_rq) {
+		/* commit outstanding execution time */
+		if (cfs_rq->curr == se)
+			update_curr(cfs_rq);
+		account_entity_dequeue(cfs_rq, se);
+		dequeue_runnable_load_avg(cfs_rq, se);
+	}
+	dequeue_load_avg(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+#ifdef CONFIG_SMP
+	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum,
+				   LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
+#endif
+
+	enqueue_load_avg(cfs_rq, se);
+	if (se->on_rq) {
+		account_entity_enqueue(cfs_rq, se);
+		enqueue_runnable_load_avg(cfs_rq, se);
+	}
+}
+
+void reweight_task(struct task_struct *p, int prio)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct load_weight *load = &se->load;
+	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
+
+	reweight_entity(cfs_rq, se, weight);
+	load->inv_weight = sched_prio_to_wmult[prio];
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
 /*
@@ -2878,32 +2915,6 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq)
 }
 # endif /* CONFIG_SMP */
 
-static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight)
-{
-	if (se->on_rq) {
-		/* commit outstanding execution time */
-		if (cfs_rq->curr == se)
-			update_curr(cfs_rq);
-		account_entity_dequeue(cfs_rq, se);
-		dequeue_runnable_load_avg(cfs_rq, se);
-	}
-	dequeue_load_avg(cfs_rq, se);
-
-	update_load_set(&se->load, weight);
-
-#ifdef CONFIG_SMP
-	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum,
-				   LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
-#endif
-
-	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
-		account_entity_enqueue(cfs_rq, se);
-		enqueue_runnable_load_avg(cfs_rq, se);
-	}
-}
-
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 
 static void update_cfs_shares(struct sched_entity *se)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14db76c..a5d9746 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1529,6 +1529,8 @@ extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
 
+extern void reweight_task(struct task_struct *p, int prio);
+
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-09-29 20:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-12 16:44 [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 01/14] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
2017-05-16  8:02   ` Vincent Guittot
2017-05-12 16:44 ` [RFC][PATCH 02/14] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 03/14] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
2017-05-17  7:04   ` Vincent Guittot
2017-05-17  9:50     ` Vincent Guittot
2017-05-17 14:20       ` Peter Zijlstra
2017-09-29 20:11       ` [tip:sched/core] sched/fair: Use reweight_entity() for set_user_nice() tip-bot for Vincent Guittot
2017-05-12 16:44 ` [RFC][PATCH 04/14] sched/fair: More accurate reweight_entity() Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 05/14] sched/fair: Change update_load_avg() arguments Peter Zijlstra
2017-05-17 10:46   ` Vincent Guittot
2017-05-12 16:44 ` [RFC][PATCH 06/14] sched/fair: Move enqueue migrate handling Peter Zijlstra
2017-05-29 13:41   ` Vincent Guittot
2017-05-12 16:44 ` [RFC][PATCH 07/14] sched/fair: Rewrite cfs_rq->removed_*avg Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 08/14] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 09/14] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 10/14] sched/fair: more obvious Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 11/14] sched/fair: Synchonous PELT detach on load-balance migrate Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 12/14] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 13/14] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
2017-05-12 16:44 ` [RFC][PATCH 14/14] sched/fair: More accurate async detach Peter Zijlstra
2017-05-16 22:02 ` [RFC][PATCH 00/14] sched/fair: A bit of a cgroup/PELT overhaul (again) Tejun Heo
2017-05-17  6:53   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).