All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
@ 2017-09-01 13:20 Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 01/18] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
                   ` (17 more replies)
  0 siblings, 18 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:20 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

Hi all,

New this time is that Josef identified and fixed a problem he had with it. He
also provided many compile fixes. Meanwhile I managed to reflow the patches so
that the whole now looks like a normal patch series.

Other than that, it should still very much be what it was before.. From the
previous announce:

---

So after staring at all that PELT stuff and working my way through it again:

  https://lkml.kernel.org/r/20170505154117.6zldxuki2fgyo53n@hirez.programming.kicks-ass.net

I started doing some patches to fix some of the identified broken.

So here are a few too many patches that do:

 - fix 'reweight_entity' to instantly propagate the change in se->load.weight.

 - rewrite/fix the propagate on migrate (attach/detach)

 - introduce the hierarchical runnable_load_avg, as proposed by Tejun.

 - synchronous detach for runnable migrates

 - aligns the PELT windows between a cfs_rq and all its se's

 - deals with random fallout from the above (some of this needs folding back
   and reordering, but its all well past the point I should post this anyway).


IIRC pjt recently mentioned the reweight_entity thing, and I have very vague
memories he once talked about the window alignment thing -- which I only
remembered after (very painfully) having discovered I really needed that.

In any case, the reason I did the reweight_entity thing first, is because I
feel that is the right place to also propagate the hierarchical runnable_load,
as that is the natural place where a group's cfs_rq is coupled to its
sched_entity.

And the hierachical runnable_load needs that coupling. TJ did it by hijacking
the attach/detach migrate code, which I didn't much like.  In any case, all
that got me looking at said attach/detach migrate code and find pain. So I went
and fixed that too.

---

Still very many thanks to Vincent, Dietmar and Josef for poking at these patches.

If nothing untowards happens, I plan to get them merged after the immenent
merge window.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 01/18] sched/fair: Clean up calc_cfs_shares()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-cleanup-update_cfs_shares.patch --]
[-- Type: text/plain, Size: 2179 bytes --]

For consistencies sake, we should have only a single reading of
tg->shares.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2633,9 +2633,12 @@ account_entity_dequeue(struct cfs_rq *cf
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static long calc_cfs_shares(struct cfs_rq *cfs_rq)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
+	struct task_group *tg = cfs_rq->tg;
+
+	tg_shares = READ_ONCE(tg->shares);
 
 	/*
 	 * This really should be: cfs_rq->avg.load_avg, but instead we use
@@ -2650,7 +2653,7 @@ static long calc_cfs_shares(struct cfs_r
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
 	tg_weight += load;
 
-	shares = (tg->shares * load);
+	shares = (tg_shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
@@ -2666,17 +2669,7 @@ static long calc_cfs_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
-}
-# else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
-{
-	return tg->shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # endif /* CONFIG_SMP */
 
@@ -2701,7 +2694,6 @@ static inline int throttled_hierarchy(st
 static void update_cfs_shares(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	struct task_group *tg;
 	long shares;
 
 	if (!cfs_rq)
@@ -2710,13 +2702,14 @@ static void update_cfs_shares(struct sch
 	if (throttled_hierarchy(cfs_rq))
 		return;
 
-	tg = cfs_rq->tg;
-
 #ifndef CONFIG_SMP
-	if (likely(se->load.weight == tg->shares))
+	shares = READ_ONCE(cfs_rq->tg->shares);
+
+	if (likely(se->load.weight == shares))
 		return;
+#else
+	shares = calc_cfs_shares(cfs_rq);
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 01/18] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-28 10:03   ` Morten Rasmussen
  2017-09-01 13:21 ` [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-comment-calc_cfs_shares.patch --]
[-- Type: text/plain, Size: 2838 bytes --]

Explain the magic equation in calc_cfs_shares() a bit better.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
+/*
+ * All this does is approximate the hierarchical proportion which includes that
+ * global sum we all love to hate.
+ *
+ * That is, the weight of a group entity, is the proportional share of the
+ * group weight based on the group runqueue weights. That is:
+ *
+ *                     tg->weight * grq->load.weight
+ *   ge->load.weight = -----------------------------               (1)
+ *			  \Sum grq->load.weight
+ *
+ * Now, because computing that sum is prohibitively expensive to compute (been
+ * there, done that) we approximate it with this average stuff. The average
+ * moves slower and therefore the approximation is cheaper and more stable.
+ *
+ * So instead of the above, we substitute:
+ *
+ *   grq->load.weight -> grq->avg.load_avg                         (2)
+ *
+ * which yields the following:
+ *
+ *                     tg->weight * grq->avg.load_avg
+ *   ge->load.weight = ------------------------------              (3)
+ *				tg->load_avg
+ *
+ * Where: tg->load_avg ~= \Sum grq->avg.load_avg
+ *
+ * That is shares_avg, and it is right (given the approximation (2)).
+ *
+ * The problem with it is that because the average is slow -- it was designed
+ * to be exactly that of course -- this leads to transients in boundary
+ * conditions. In specific, the case where the group was idle and we start the
+ * one task. It takes time for our CPU's grq->avg.load_avg to build up,
+ * yielding bad latency etc..
+ *
+ * Now, in that special case (1) reduces to:
+ *
+ *                     tg->weight * grq->load.weight
+ *   ge->load.weight = ----------------------------- = tg>weight   (4)
+ *			    grp->load.weight
+ *
+ * That is, the sum collapses because all other CPUs are idle; the UP scenario.
+ *
+ * So what we do is modify our approximation (3) to approach (4) in the (near)
+ * UP case, like:
+ *
+ *   ge->load.weight =
+ *
+ *              tg->weight * grq->load.weight
+ *     ---------------------------------------------------         (5)
+ *     tg->load_avg - grq->avg.load_avg + grq->load.weight
+ *
+ *
+ * And that is shares_weight and is icky. In the (near) UP case it approaches
+ * (4) while in the normal case it approaches (3). It consistently
+ * overestimates the ge->load.weight and therefore:
+ *
+ *   \Sum ge->load.weight >= tg->weight
+ *
+ * hence icky!
+ */
 static long calc_cfs_shares(struct cfs_rq *cfs_rq)
 {
 	long tg_weight, tg_shares, load, shares;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 01/18] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-29  9:04   ` Morten Rasmussen
  2017-09-01 13:21 ` [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-calc_cfs_shares-fixup.patch --]
[-- Type: text/plain, Size: 1297 bytes --]

Vincent reported that when running in a cgroup, his root
cfs_rq->avg.load_avg dropped to 0 on task idle.

This is because reweight_entity() will now immediately propagate the
weight change of the group entity to its cfs_rq, and as it happens,
our approxmation (5) for calc_cfs_shares() results in 0 when the group
is idle.

Avoid this by using the correct (3) as a lower bound on (5). This way
the empty cgroup will slowly decay instead of instantly drop to 0.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2703,11 +2703,10 @@ static long calc_cfs_shares(struct cfs_r
 	tg_shares = READ_ONCE(tg->shares);
 
 	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
+	 * Because (5) drops to 0 when the cfs_rq is idle, we need to use (3)
+	 * as a lower bound.
 	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-29 15:26   ` Morten Rasmussen
  2017-09-01 13:21 ` [PATCH -v2 05/18] sched/fair: Change update_load_avg() arguments Peter Zijlstra
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-unweight-entity.patch --]
[-- Type: text/plain, Size: 6482 bytes --]

Remove the load from the load_sum for sched_entities, basically
turning load_sum into runnable_sum.  This prepares for better
reweighting of group entities.

Since we now have different rules for computing load_avg, split
___update_load_avg() into two parts, ___update_load_sum() and
___update_load_avg().

So for se:

  ___update_load_sum(.weight = 1)
  ___upate_load_avg(.weight = se->load.weight)

and for cfs_rq:

  ___update_load_sum(.weight = cfs_rq->load.weight)
  ___upate_load_avg(.weight = 1)

Since the primary consumable is load_avg, most things will not be
affected. Only those few sites that initialize/modify load_sum need
attention.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   91 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 64 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -744,7 +744,7 @@ void init_entity_runnable_average(struct
 	 */
 	if (entity_is_task(se))
 		sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+	sa->load_sum = LOAD_AVG_MAX;
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -1967,7 +1967,7 @@ static u64 numa_get_avg_runtime(struct t
 		delta = runtime - p->last_sum_exec_runtime;
 		*period = now - p->last_task_numa_placement;
 	} else {
-		delta = p->se.avg.load_sum / p->se.load.weight;
+		delta = p->se.avg.load_sum;
 		*period = LOAD_AVG_MAX;
 	}
 
@@ -2872,8 +2872,8 @@ accumulate_sum(u64 delta, int cpu, struc
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta;
 
@@ -2907,39 +2907,80 @@ ___update_load_avg(u64 now, int cpu, str
 	if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
 		return 0;
 
+	return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long weight, struct cfs_rq *cfs_rq)
+{
+	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
 	/*
 	 * Step 2: update *_avg.
 	 */
-	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
+	sa->load_avg = div_u64(weight * sa->load_sum, divider);
 	if (cfs_rq) {
 		cfs_rq->runnable_load_avg =
-			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
+			div_u64(cfs_rq->runnable_load_sum, divider);
 	}
-	sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
+	sa->util_avg = sa->util_sum / divider;
+}
 
-	return 1;
+/*
+ * XXX we want to get rid of this helper and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+	return scale_load_down(se->load.weight);
 }
 
+/*
+ * sched_entity:
+ *
+ *   load_sum := runnable_sum
+ *   load_avg = se_weight(se) * runnable_avg
+ *
+ * cfq_rs:
+ *
+ *   load_sum = \Sum se_weight(se) * se->avg.load_sum
+ *   load_avg = \Sum se->avg.load_avg
+ */
+
 static int
 __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
 {
-	return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL);
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, NULL)) {
+		___update_load_avg(&se->avg, se_weight(se), NULL);
+		return 1;
+	}
+
+	return 0;
 }
 
 static int
 __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return ___update_load_avg(now, cpu, &se->avg,
-				  se->on_rq * scale_load_down(se->load.weight),
-				  cfs_rq->curr == se, NULL);
+	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq,
+				cfs_rq->curr == se, NULL)) {
+
+		___update_load_avg(&se->avg, se_weight(se), NULL);
+		return 1;
+	}
+
+	return 0;
 }
 
 static int
 __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 {
-	return ___update_load_avg(now, cpu, &cfs_rq->avg,
-			scale_load_down(cfs_rq->load.weight),
-			cfs_rq->curr != NULL, cfs_rq);
+	if (___update_load_sum(now, cpu, &cfs_rq->avg,
+				scale_load_down(cfs_rq->load.weight),
+				cfs_rq->curr != NULL, cfs_rq)) {
+		___update_load_avg(&cfs_rq->avg, 1, cfs_rq);
+		return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -3110,7 +3151,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 
 	/* Set new sched_entity's load */
 	se->avg.load_avg = load;
-	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
+	se->avg.load_sum = LOAD_AVG_MAX;
 
 	/* Update parent cfs_rq load */
 	add_positive(&cfs_rq->avg.load_avg, delta);
@@ -3340,7 +3381,7 @@ static void attach_entity_load_avg(struc
 {
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
-	cfs_rq->avg.load_sum += se->avg.load_sum;
+	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 	set_tg_cfs_propagate(cfs_rq);
@@ -3360,7 +3401,7 @@ static void detach_entity_load_avg(struc
 {
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 	set_tg_cfs_propagate(cfs_rq);
@@ -3372,12 +3413,10 @@ static void detach_entity_load_avg(struc
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct sched_avg *sa = &se->avg;
-
-	cfs_rq->runnable_load_avg += sa->load_avg;
-	cfs_rq->runnable_load_sum += sa->load_sum;
+	cfs_rq->runnable_load_avg += se->avg.load_avg;
+	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
 
-	if (!sa->last_update_time) {
+	if (!se->avg.last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
 		update_tg_load_avg(cfs_rq, 0);
 	}
@@ -3387,10 +3426,8 @@ enqueue_entity_load_avg(struct cfs_rq *c
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_load_avg =
-		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
-	cfs_rq->runnable_load_sum =
-		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
+	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
 }
 
 #ifndef CONFIG_64BIT

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 05/18] sched/fair: Change update_load_avg() arguments
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 06/18] sched/fair: Move enqueue migrate handling Peter Zijlstra
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-update_load_avg-args.patch --]
[-- Type: text/plain, Size: 4635 bytes --]

Most call sites of update_load_avg() already have cfs_rq_of(se)
available, pass it down instead of recomputing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3498,9 +3498,8 @@ update_cfs_rq_load_avg(u64 now, struct c
 #define SKIP_AGE_LOAD	0x2
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int flags)
+static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
@@ -3661,9 +3660,9 @@ update_cfs_rq_load_avg(u64 now, struct c
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 
-static inline void update_load_avg(struct sched_entity *se, int not_used1)
+static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
-	cfs_rq_util_change(cfs_rq_of(se));
+	cfs_rq_util_change(cfs_rq);
 }
 
 static inline void
@@ -3814,7 +3813,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
-	update_load_avg(se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, UPDATE_TG);
 	enqueue_entity_load_avg(cfs_rq, se);
 	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);
@@ -3898,7 +3897,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
-	update_load_avg(se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, UPDATE_TG);
 	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -3986,7 +3985,7 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -4088,7 +4087,7 @@ static void put_prev_entity(struct cfs_r
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_load_avg(cfs_rq, prev, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -4104,7 +4103,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, UPDATE_TG);
+	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_shares(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -5022,7 +5021,7 @@ enqueue_task_fair(struct rq *rq, struct
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_shares(se);
 	}
 
@@ -5081,7 +5080,7 @@ static void dequeue_task_fair(struct rq
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_shares(se);
 	}
 
@@ -7044,7 +7043,7 @@ static void update_blocked_averages(int
 		/* Propagate pending load changes to the parent, if any: */
 		se = cfs_rq->tg->se[cpu];
 		if (se && !skip_blocked_update(se))
-			update_load_avg(se, 0);
+			update_load_avg(cfs_rq_of(se), se, 0);
 
 		/*
 		 * There can be a lot of idle CPU cgroups.  Don't let fully
@@ -9185,7 +9184,7 @@ static void propagate_entity_cfs_rq(stru
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, UPDATE_TG);
+		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
 }
 #else
@@ -9197,7 +9196,7 @@ static void detach_entity_cfs_rq(struct
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_load_avg(se, 0);
+	update_load_avg(cfs_rq, se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 	propagate_entity_cfs_rq(se);
@@ -9216,7 +9215,7 @@ static void attach_entity_cfs_rq(struct
 #endif
 
 	/* Synchronize entity with its cfs_rq */
-	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
+	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 	propagate_entity_cfs_rq(se);
@@ -9500,7 +9499,7 @@ int sched_group_set_shares(struct task_g
 		rq_lock_irqsave(rq, &rf);
 		update_rq_clock(rq);
 		for_each_sched_entity(se) {
-			update_load_avg(se, UPDATE_TG);
+			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
 			update_cfs_shares(se);
 		}
 		rq_unlock_irqrestore(rq, &rf);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 06/18] sched/fair: Move enqueue migrate handling
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 05/18] sched/fair: Change update_load_avg() arguments Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 07/18] sched/fair: Rename {en,de}queue_entity_load_avg() Peter Zijlstra
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-pull-migrate-into-update_load_avg.patch --]
[-- Type: text/plain, Size: 4074 bytes --]

Move the entity migrate handling from enqueue_entity_load_avg() to
update_load_avg(). This has two benefits:

 - {en,de}queue_entity_load_avg() will become purely about managing
   runnable_load

 - we can avoid a double update_tg_load_avg() and reduce pressure on
   the global tg->shares cacheline

The reason we do this is so that we can change update_cfs_shares() to
change both weight and (future) runnable_weight. For this to work we
need to have the cfs_rq averages up-to-date (which means having done
the attach), but we need the cfs_rq->avg.runnable_avg to not yet
include the se's contribution (since se->on_rq == 0).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   70 ++++++++++++++++++++++++++--------------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3491,34 +3491,6 @@ update_cfs_rq_load_avg(u64 now, struct c
 	return decayed || removed_load;
 }
 
-/*
- * Optional action to be done while updating the load average
- */
-#define UPDATE_TG	0x1
-#define SKIP_AGE_LOAD	0x2
-
-/* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
-{
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	struct rq *rq = rq_of(cfs_rq);
-	int cpu = cpu_of(rq);
-	int decayed;
-
-	/*
-	 * Track task load average for carrying it to new CPU after migrated, and
-	 * track group sched_entity load average for task_h_load calc in migration
-	 */
-	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-		__update_load_avg_se(now, cpu, cfs_rq, se);
-
-	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
-	decayed |= propagate_entity_load_avg(se);
-
-	if (decayed && (flags & UPDATE_TG))
-		update_tg_load_avg(cfs_rq, 0);
-}
-
 /**
  * attach_entity_load_avg - attach this entity to its cfs_rq load avg
  * @cfs_rq: cfs_rq to attach to
@@ -3559,17 +3531,46 @@ static void detach_entity_load_avg(struc
 	cfs_rq_util_change(cfs_rq);
 }
 
+/*
+ * Optional action to be done while updating the load average
+ */
+#define UPDATE_TG	0x1
+#define SKIP_AGE_LOAD	0x2
+#define DO_ATTACH	0x4
+
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	u64 now = cfs_rq_clock_task(cfs_rq);
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
+	int decayed;
+
+	/*
+	 * Track task load average for carrying it to new CPU after migrated, and
+	 * track group sched_entity load average for task_h_load calc in migration
+	 */
+	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+		__update_load_avg_se(now, cpu, cfs_rq, se);
+
+	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
+	decayed |= propagate_entity_load_avg(se);
+
+	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
+
+		attach_entity_load_avg(cfs_rq, se);
+		update_tg_load_avg(cfs_rq, 0);
+
+	} else if (decayed && (flags & UPDATE_TG))
+		update_tg_load_avg(cfs_rq, 0);
+}
+
 /* Add the load generated by se into cfs_rq's load average */
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	cfs_rq->runnable_load_avg += se->avg.load_avg;
 	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
-
-	if (!se->avg.last_update_time) {
-		attach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq, 0);
-	}
 }
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
@@ -3659,6 +3660,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
+#define DO_ATTACH	0x0
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
@@ -3813,7 +3815,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	enqueue_entity_load_avg(cfs_rq, se);
 	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 07/18] sched/fair: Rename {en,de}queue_entity_load_avg()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 06/18] sched/fair: Move enqueue migrate handling Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 08/18] sched/fair: Introduce {en,de}queue_load_avg() Peter Zijlstra
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-rename-enqueue_entity_load_avg.patch --]
[-- Type: text/plain, Size: 2237 bytes --]

Since they're now purely about runnable_load, rename them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3561,7 +3561,7 @@ static inline void update_load_avg(struc
 
 /* Add the load generated by se into cfs_rq's load average */
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	cfs_rq->runnable_load_avg += se->avg.load_avg;
 	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
@@ -3569,7 +3569,7 @@ enqueue_entity_load_avg(struct cfs_rq *c
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
@@ -3662,9 +3662,9 @@ static inline void update_load_avg(struc
 }
 
 static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
 
 static inline void
@@ -3810,7 +3810,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_runnable_load_avg(cfs_rq, se);
 	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -3894,7 +3894,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 08/18] sched/fair: Introduce {en,de}queue_load_avg()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 07/18] sched/fair: Rename {en,de}queue_entity_load_avg() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 09/18] sched/fair: More accurate reweight_entity() Peter Zijlstra
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-add-enqueue_load_avg.patch --]
[-- Type: text/plain, Size: 7383 bytes --]

Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
for {en,de}queue_load_avg(). More users will follow.

Includes some code movement to avoid fwd declarations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |  156 ++++++++++++++++++++++++++++------------------------
 1 file changed, 86 insertions(+), 70 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2705,6 +2705,90 @@ account_entity_dequeue(struct cfs_rq *cf
 	cfs_rq->nr_running--;
 }
 
+/*
+ * Signed add and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define add_positive(_ptr, _val) do {                           \
+	typeof(_ptr) ptr = (_ptr);                              \
+	typeof(_val) val = (_val);                              \
+	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
+								\
+	res = var + val;                                        \
+								\
+	if (val < 0 && res > var)                               \
+		res = 0;                                        \
+								\
+	WRITE_ONCE(*ptr, res);                                  \
+} while (0)
+
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
+#ifdef CONFIG_SMP
+/*
+ * XXX we want to get rid of this helper and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+	return scale_load_down(se->load.weight);
+}
+
+static inline void
+enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	cfs_rq->runnable_load_avg += se->avg.load_avg;
+	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
+}
+
+static inline void
+dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
+}
+
+static inline void
+enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	cfs_rq->avg.load_avg += se->avg.load_avg;
+	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+}
+
+static inline void
+dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+}
+#else
+static inline void
+enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+static inline void
+dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+static inline void
+enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+static inline void
+dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
 /*
@@ -3097,14 +3181,6 @@ ___update_load_avg(struct sched_avg *sa,
 }
 
 /*
- * XXX we want to get rid of this helper and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
-	return scale_load_down(se->load.weight);
-}
-
-/*
  * sched_entity:
  *
  *   load_sum := runnable_sum
@@ -3153,26 +3229,6 @@ __update_load_avg_cfs_rq(u64 now, int cp
 	return 0;
 }
 
-/*
- * Signed add and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define add_positive(_ptr, _val) do {                           \
-	typeof(_ptr) ptr = (_ptr);                              \
-	typeof(_val) val = (_val);                              \
-	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
-								\
-	res = var + val;                                        \
-								\
-	if (val < 0 && res > var)                               \
-		res = 0;                                        \
-								\
-	WRITE_ONCE(*ptr, res);                                  \
-} while (0)
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /**
  * update_tg_load_avg - update the tg's load avg
@@ -3417,23 +3473,6 @@ static inline void set_tg_cfs_propagate(
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/*
- * Unsigned subtract and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define sub_positive(_ptr, _val) do {				\
-	typeof(_ptr) ptr = (_ptr);				\
-	typeof(*ptr) val = (_val);				\
-	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
-	res = var - val;					\
-	if (res > var)						\
-		res = 0;					\
-	WRITE_ONCE(*ptr, res);					\
-} while (0)
-
 /**
  * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
  * @now: current time, as per cfs_rq_clock_task()
@@ -3496,8 +3535,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
-	cfs_rq->avg.load_avg += se->avg.load_avg;
-	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 	set_tg_cfs_propagate(cfs_rq);
@@ -3515,9 +3553,7 @@ static void attach_entity_load_avg(struc
  */
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-
-	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 	set_tg_cfs_propagate(cfs_rq);
@@ -3559,22 +3595,6 @@ static inline void update_load_avg(struc
 		update_tg_load_avg(cfs_rq, 0);
 }
 
-/* Add the load generated by se into cfs_rq's load average */
-static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_load_avg += se->avg.load_avg;
-	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
-}
-
-/* Remove the runnable load generated by se from cfs_rq's runnable load average */
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
-}
-
 #ifndef CONFIG_64BIT
 static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 {
@@ -3661,10 +3681,6 @@ static inline void update_load_avg(struc
 	cfs_rq_util_change(cfs_rq);
 }
 
-static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
 
 static inline void

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 09/18] sched/fair: More accurate reweight_entity()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 08/18] sched/fair: Introduce {en,de}queue_load_avg() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 10/18] sched/fair: Use reweight_entity() for set_user_nice() Peter Zijlstra
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fix-reweight_entity.patch --]
[-- Type: text/plain, Size: 1464 bytes --]

When a (group) entity changes it's weight we should instantly change
its load_avg and propagate that change into the sums it is part of.
Because we use these values to predict future behaviour and are not
interested in its historical value.

Without this change, the change in load would need to propagate
through the average, by which time it could again have changed etc..
always chasing itself.

With this change, the cfs_rq load_avg sum will more accurately reflect
the current runnable and expected return of blocked load.

[josef: compile fix !SMP || !FAIR_GROUP]
Reported-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2900,12 +2900,22 @@ static void reweight_entity(struct cfs_r
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
 		account_entity_dequeue(cfs_rq, se);
+		dequeue_runnable_load_avg(cfs_rq, se);
 	}
+	dequeue_load_avg(cfs_rq, se);
 
 	update_load_set(&se->load, weight);
 
-	if (se->on_rq)
+#ifdef CONFIG_SMP
+	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum,
+				   LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
+#endif
+
+	enqueue_load_avg(cfs_rq, se);
+	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
+		enqueue_runnable_load_avg(cfs_rq, se);
+	}
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 10/18] sched/fair: Use reweight_entity() for set_user_nice()
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 09/18] sched/fair: More accurate reweight_entity() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 11/18] sched/fair: Rewrite cfs_rq->removed_*avg Peter Zijlstra
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: vincent_guittot-sched_fair-remove_se-_load_weight_from_se-_avg_load_sum.patch --]
[-- Type: text/plain, Size: 5076 bytes --]

From: Vincent Guittot <vincent.guittot@linaro.org>

Now that we directly change load_avg and propagate that change into
the sums, sys_nice() and co should do the same, otherwise its possible
to confuse load accounting when we migrate near the weight change.

[peterz: Changelog, call condition]
[josef: fixed runnable and !SMP compilation]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org
---
 kernel/sched/core.c  |   22 ++++++++++++-----
 kernel/sched/fair.c  |   63 +++++++++++++++++++++++++++++----------------------
 kernel/sched/sched.h |    2 +
 3 files changed, 54 insertions(+), 33 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -733,7 +733,7 @@ int tg_nop(struct task_group *tg, void *
 }
 #endif
 
-static void set_load_weight(struct task_struct *p)
+static void set_load_weight(struct task_struct *p, bool update_load)
 {
 	int prio = p->static_prio - MAX_RT_PRIO;
 	struct load_weight *load = &p->se.load;
@@ -747,8 +747,16 @@ static void set_load_weight(struct task_
 		return;
 	}
 
-	load->weight = scale_load(sched_prio_to_weight[prio]);
-	load->inv_weight = sched_prio_to_wmult[prio];
+	/*
+	 * SCHED_OTHER tasks have to update their load when changing their
+	 * weight
+	 */
+	if (update_load && p->sched_class == &fair_sched_class) {
+		reweight_task(p, prio);
+	} else {
+		load->weight = scale_load(sched_prio_to_weight[prio]);
+		load->inv_weight = sched_prio_to_wmult[prio];
+	}
 }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2356,7 +2364,7 @@ int sched_fork(unsigned long clone_flags
 			p->static_prio = NICE_TO_PRIO(0);
 
 		p->prio = p->normal_prio = __normal_prio(p);
-		set_load_weight(p);
+		set_load_weight(p, false);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -3803,7 +3811,7 @@ void set_user_nice(struct task_struct *p
 		put_prev_task(rq, p);
 
 	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p);
+	set_load_weight(p, true);
 	old_prio = p->prio;
 	p->prio = effective_prio(p);
 	delta = p->prio - old_prio;
@@ -3960,7 +3968,7 @@ static void __setscheduler_params(struct
 	 */
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
-	set_load_weight(p);
+	set_load_weight(p, true);
 }
 
 /* Actually do priority change: must hold pi & rq lock. */
@@ -5910,7 +5918,7 @@ void __init sched_init(void)
 		atomic_set(&rq->nr_iowait, 0);
 	}
 
-	set_load_weight(&init_task);
+	set_load_weight(&init_task, false);
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,43 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->on_rq) {
+		/* commit outstanding execution time */
+		if (cfs_rq->curr == se)
+			update_curr(cfs_rq);
+		account_entity_dequeue(cfs_rq, se);
+		dequeue_runnable_load_avg(cfs_rq, se);
+	}
+	dequeue_load_avg(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+#ifdef CONFIG_SMP
+	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum,
+				   LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
+#endif
+
+	enqueue_load_avg(cfs_rq, se);
+	if (se->on_rq) {
+		account_entity_enqueue(cfs_rq, se);
+		enqueue_runnable_load_avg(cfs_rq, se);
+	}
+}
+
+void reweight_task(struct task_struct *p, int prio)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct load_weight *load = &se->load;
+	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
+
+	reweight_entity(cfs_rq, se, weight);
+	load->inv_weight = sched_prio_to_wmult[prio];
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
 /*
@@ -2892,32 +2929,6 @@ static long calc_cfs_shares(struct cfs_r
 }
 # endif /* CONFIG_SMP */
 
-static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight)
-{
-	if (se->on_rq) {
-		/* commit outstanding execution time */
-		if (cfs_rq->curr == se)
-			update_curr(cfs_rq);
-		account_entity_dequeue(cfs_rq, se);
-		dequeue_runnable_load_avg(cfs_rq, se);
-	}
-	dequeue_load_avg(cfs_rq, se);
-
-	update_load_set(&se->load, weight);
-
-#ifdef CONFIG_SMP
-	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum,
-				   LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
-#endif
-
-	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
-		account_entity_enqueue(cfs_rq, se);
-		enqueue_runnable_load_avg(cfs_rq, se);
-	}
-}
-
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 
 static void update_cfs_shares(struct sched_entity *se)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1532,6 +1532,8 @@ extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
 
+extern void reweight_task(struct task_struct *p, int prio);
+
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 11/18] sched/fair: Rewrite cfs_rq->removed_*avg
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 10/18] sched/fair: Use reweight_entity() for set_user_nice() Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-update_tg_cfs_load.patch --]
[-- Type: text/plain, Size: 5301 bytes --]

Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.

Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.

Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    8 ++++----
 kernel/sched/fair.c  |   51 +++++++++++++++++++++++++++++++++++----------------
 kernel/sched/sched.h |   13 +++++++++----
 3 files changed, 48 insertions(+), 24 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -519,10 +519,10 @@ void print_cfs_rq(struct seq_file *m, in
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed_load_avg",
-			atomic_long_read(&cfs_rq->removed_load_avg));
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed_util_avg",
-			atomic_long_read(&cfs_rq->removed_util_avg));
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
+			cfs_rq->removed.load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
+			cfs_rq->removed.util_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3459,36 +3459,47 @@ static inline void set_tg_cfs_propagate(
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
+	unsigned long removed_load = 0, removed_util = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
-	int decayed, removed_load = 0, removed_util = 0;
+	int decayed = 0;
 
-	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
-		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
+	if (cfs_rq->removed.nr) {
+		unsigned long r;
+
+		raw_spin_lock(&cfs_rq->removed.lock);
+		swap(cfs_rq->removed.util_avg, removed_util);
+		swap(cfs_rq->removed.load_avg, removed_load);
+		cfs_rq->removed.nr = 0;
+		raw_spin_unlock(&cfs_rq->removed.lock);
+
+		/*
+		 * The LOAD_AVG_MAX for _sum is a slight over-estimate,
+		 * which is safe due to sub_positive() clipping at 0.
+		 */
+		r = removed_load;
 		sub_positive(&sa->load_avg, r);
 		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
-		removed_load = 1;
-		set_tg_cfs_propagate(cfs_rq);
-	}
 
-	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
-		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
+		r = removed_util;
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
-		removed_util = 1;
+
 		set_tg_cfs_propagate(cfs_rq);
+
+		decayed = 1;
 	}
 
-	decayed = __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
+	decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (decayed || removed_util)
+	if (decayed)
 		cfs_rq_util_change(cfs_rq);
 
-	return decayed || removed_load;
+	return decayed;
 }
 
 /**
@@ -3622,6 +3633,7 @@ void sync_entity_load_avg(struct sched_e
 void remove_entity_load_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	unsigned long flags;
 
 	/*
 	 * tasks cannot exit without having gone through wake_up_new_task() ->
@@ -3631,11 +3643,19 @@ void remove_entity_load_avg(struct sched
 	 * Similarly for groups, they will have passed through
 	 * post_init_entity_util_avg() before unregister_sched_fair_group()
 	 * calls this.
+	 *
+	 * XXX in case entity_is_task(se) && task_of(se)->on_rq == MIGRATING
+	 * we could actually get the right time, since we're called with
+	 * rq->lock held, see detach_task().
 	 */
 
 	sync_entity_load_avg(se);
-	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
-	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
+
+	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
+	++cfs_rq->removed.nr;
+	cfs_rq->removed.util_avg	+= se->avg.util_avg;
+	cfs_rq->removed.load_avg	+= se->avg.load_avg;
+	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
@@ -9302,8 +9322,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	cfs_rq->propagate_avg = 0;
 #endif
-	atomic_long_set(&cfs_rq->removed_load_avg, 0);
-	atomic_long_set(&cfs_rq->removed_util_avg, 0);
+	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -446,14 +446,19 @@ struct cfs_rq {
 	struct sched_avg avg;
 	u64 runnable_load_sum;
 	unsigned long runnable_load_avg;
+#ifndef CONFIG_64BIT
+	u64 load_last_update_time_copy;
+#endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
 	unsigned long propagate_avg;
 #endif
-	atomic_long_t removed_load_avg, removed_util_avg;
-#ifndef CONFIG_64BIT
-	u64 load_last_update_time_copy;
-#endif
+	struct {
+		raw_spinlock_t	lock ____cacheline_aligned;
+		int		nr;
+		unsigned long	load_avg;
+		unsigned long	util_avg;
+	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (10 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 11/18] sched/fair: Rewrite cfs_rq->removed_*avg Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-10-09  8:08   ` Morten Rasmussen
  2017-10-09 15:03   ` Vincent Guittot
  2017-09-01 13:21 ` [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
                   ` (5 subsequent siblings)
  17 siblings, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-update_tg_cfs_load-1.patch --]
[-- Type: text/plain, Size: 11533 bytes --]

When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.

In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):

                     tg->weight * grq->avg.load_avg
  ge->avg.load_avg = ------------------------------
                               tg->load_avg

But that is the expression for ge->weight, and per the definition of
load_avg:

  ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.

Instead directly propagate runnable_sum.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    2 
 kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
 kernel/sched/sched.h |    9 +-
 3 files changed, 112 insertions(+), 85 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
+			cfs_rq->removed.runnable_sum);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
 	se->avg.last_update_time = n_last_update_time;
 }
 
-/* Take into account change of utilization of a child task group */
+
+/*
+ * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
+ * propagate its contribution. The key to this propagation is the invariant
+ * that for each group:
+ *
+ *   ge->avg == grq->avg						(1)
+ *
+ * _IFF_ we look at the pure running and runnable sums. Because they
+ * represent the very same entity, just at different points in the hierarchy.
+ *
+ *
+ * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
+ * simply copies the running sum over.
+ *
+ * However, update_tg_cfs_runnable() is more complex. So we have:
+ *
+ *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
+ *
+ * And since, like util, the runnable part should be directly transferable,
+ * the following would _appear_ to be the straight forward approach:
+ *
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *
+ * And per (1) we have:
+ *
+ *   ge->avg.running_avg == grq->avg.running_avg
+ *
+ * Which gives:
+ *
+ *                      ge->load.weight * grq->avg.load_avg
+ *   ge->avg.load_avg = -----------------------------------		(4)
+ *                               grq->load.weight
+ *
+ * Except that is wrong!
+ *
+ * Because while for entities historical weight is not important and we
+ * really only care about our future and therefore can consider a pure
+ * runnable sum, runqueues can NOT do this.
+ *
+ * We specifically want runqueues to have a load_avg that includes
+ * historical weights. Those represent the blocked load, the load we expect
+ * to (shortly) return to us. This only works by keeping the weights as
+ * integral part of the sum. We therefore cannot decompose as per (3).
+ *
+ * OK, so what then?
+ *
+ *
+ * Another way to look at things is:
+ *
+ *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *
+ * Therefore, per (2):
+ *
+ *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ *
+ * And the very thing we're propagating is a change in that sum (someone
+ * joined/left). So we can easily know the runnable change, which would be, per
+ * (2) the already tracked se->load_avg divided by the corresponding
+ * se->weight.
+ *
+ * Basically (4) but in differential form:
+ *
+ *   d(runnable_avg) += se->avg.load_avg / se->load.weight
+ *								   (5)
+ *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
+ */
+
 static inline void
-update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
 
 	/* Nothing to update */
@@ -3350,102 +3416,59 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
-/* Take into account change of load of a child task group */
 static inline void
-update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
-
-	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
-	 */
-	if (load) {
-		long tg_load;
+	long runnable_sum = gcfs_rq->prop_runnable_sum;
+	long load_avg;
+	s64 load_sum;
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
-	}
+	if (!runnable_sum)
+		return;
 
-	delta = load - se->avg.load_avg;
+	gcfs_rq->prop_runnable_sum = 0;
 
-	/* Nothing to update */
-	if (!delta)
-		return;
+	load_sum = (s64)se_weight(se) * runnable_sum;
+	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	/* Set new sched_entity's load */
-	se->avg.load_avg = load;
-	se->avg.load_sum = LOAD_AVG_MAX;
+	add_positive(&se->avg.load_sum, runnable_sum);
+	add_positive(&se->avg.load_avg, load_avg);
 
-	/* Update parent cfs_rq load */
-	add_positive(&cfs_rq->avg.load_avg, delta);
-	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+	add_positive(&cfs_rq->avg.load_avg, load_avg);
+	add_positive(&cfs_rq->avg.load_sum, load_sum);
 
-	/*
-	 * If the sched_entity is already enqueued, we also have to update the
-	 * runnable load avg.
-	 */
 	if (se->on_rq) {
-		/* Update parent cfs_rq runnable_load_avg */
-		add_positive(&cfs_rq->runnable_load_avg, delta);
-		cfs_rq->runnable_load_sum = cfs_rq->runnable_load_avg * LOAD_AVG_MAX;
+		add_positive(&cfs_rq->runnable_load_avg, load_avg);
+		add_positive(&cfs_rq->runnable_load_sum, load_sum);
 	}
 }
 
-static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
-{
-	cfs_rq->propagate_avg = 1;
-}
-
-static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
+static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
 {
-	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-
-	if (!cfs_rq->propagate_avg)
-		return 0;
-
-	cfs_rq->propagate_avg = 0;
-	return 1;
+	cfs_rq->propagate = 1;
+	cfs_rq->prop_runnable_sum += runnable_sum;
 }
 
 /* Update task and its cfs_rq load average */
 static inline int propagate_entity_load_avg(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *gcfs_rq;
 
 	if (entity_is_task(se))
 		return 0;
 
-	if (!test_and_clear_tg_cfs_propagate(se))
+	gcfs_rq = group_cfs_rq(se);
+	if (!gcfs_rq->propagate)
 		return 0;
 
+	gcfs_rq->propagate = 0;
+
 	cfs_rq = cfs_rq_of(se);
 
-	set_tg_cfs_propagate(cfs_rq);
+	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
-	update_tg_cfs_util(cfs_rq, se);
-	update_tg_cfs_load(cfs_rq, se);
+	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 
 	return 1;
 }
@@ -3469,7 +3492,7 @@ static inline bool skip_blocked_update(s
 	 * If there is a pending propagation, we have to update the load and
 	 * the utilization of the sched_entity:
 	 */
-	if (gcfs_rq->propagate_avg)
+	if (gcfs_rq->propagate)
 		return false;
 
 	/*
@@ -3489,7 +3512,7 @@ static inline int propagate_entity_load_
 	return 0;
 }
 
-static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
+static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -3512,7 +3535,7 @@ static inline void set_tg_cfs_propagate(
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3522,6 +3545,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
+		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3537,7 +3561,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 
-		set_tg_cfs_propagate(cfs_rq);
+		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
 
 		decayed = 1;
 	}
@@ -3569,7 +3593,8 @@ static void attach_entity_load_avg(struc
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
-	set_tg_cfs_propagate(cfs_rq);
+
+	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3587,7 +3612,8 @@ static void detach_entity_load_avg(struc
 	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
-	set_tg_cfs_propagate(cfs_rq);
+
+	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3689,6 +3715,7 @@ void remove_entity_load_avg(struct sched
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
+	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
@@ -9468,9 +9495,6 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
 #ifdef CONFIG_SMP
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	cfs_rq->propagate_avg = 0;
-#endif
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -449,18 +449,19 @@ struct cfs_rq {
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	unsigned long tg_load_avg_contrib;
-	unsigned long propagate_avg;
-#endif
 	struct {
 		raw_spinlock_t	lock ____cacheline_aligned;
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
+		unsigned long	runnable_sum;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	unsigned long tg_load_avg_contrib;
+	long propagate;
+	long prop_runnable_sum;
+
 	/*
 	 *   h_load = weight * f(tg)
 	 *

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (11 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-10-02 17:46   ` Dietmar Eggemann
  2017-10-03 12:26   ` Dietmar Eggemann
  2017-09-01 13:21 ` [PATCH -v2 14/18] sched/fair: Synchonous PELT detach on load-balance migrate Peter Zijlstra
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: tj-sched-fair-avg-runnable_load.patch --]
[-- Type: text/plain, Size: 16170 bytes --]

The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:

  runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.

However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.

Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.

Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    3 
 kernel/sched/debug.c  |    8 ++
 kernel/sched/fair.c   |  173 ++++++++++++++++++++++++++++++++------------------
 kernel/sched/sched.h  |    3 
 4 files changed, 125 insertions(+), 62 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -332,9 +332,11 @@ struct load_weight {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 };
 
@@ -377,6 +379,7 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
+	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -436,9 +436,11 @@ static void print_cfs_group_stats(struct
 		P_SCHEDSTAT(se->statistics.wait_count);
 	}
 	P(se->load.weight);
+	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -555,10 +557,11 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
+	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->runnable_load_avg);
+			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
@@ -1003,10 +1006,13 @@ void proc_sched_show_task(struct task_st
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
+	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
+	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
+	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -743,8 +743,9 @@ void init_entity_runnable_average(struct
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->load_avg = scale_load_down(se->load.weight);
-	sa->load_sum = LOAD_AVG_MAX;
+		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
+	sa->runnable_load_sum = sa->load_sum = LOAD_AVG_MAX;
+
 	/*
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
@@ -2744,25 +2745,35 @@ account_entity_dequeue(struct cfs_rq *cf
 
 #ifdef CONFIG_SMP
 /*
- * XXX we want to get rid of this helper and use the full load resolution.
+ * XXX we want to get rid of these helpers and use the full load resolution.
  */
 static inline long se_weight(struct sched_entity *se)
 {
 	return scale_load_down(se->load.weight);
 }
 
+static inline long se_runnable(struct sched_entity *se)
+{
+	return scale_load_down(se->runnable_weight);
+}
+
 static inline void
 enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_load_avg += se->avg.load_avg;
-	cfs_rq->runnable_load_sum += se_weight(se) * se->avg.load_sum;
+	cfs_rq->runnable_weight += se->runnable_weight;
+
+	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
+	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
 }
 
 static inline void
 dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->runnable_load_sum, se_weight(se) * se->avg.load_sum);
+	cfs_rq->runnable_weight -= se->runnable_weight;
+
+	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
+	sub_positive(&cfs_rq->avg.runnable_load_sum,
+		     se_runnable(se) * se->avg.runnable_load_sum);
 }
 
 static inline void
@@ -2790,7 +2802,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight)
+			    unsigned long weight, unsigned long runnable)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
@@ -2801,11 +2813,17 @@ static void reweight_entity(struct cfs_r
 	}
 	dequeue_load_avg(cfs_rq, se);
 
+	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
-	se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum,
-				   LOAD_AVG_MAX - 1024 + se->avg.period_contrib);
+	do {
+		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
+
+		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
+		se->avg.runnable_load_avg =
+			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
+	} while (0);
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
@@ -2822,7 +2840,7 @@ void reweight_task(struct task_struct *p
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight);
+	reweight_entity(cfs_rq, se, weight, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -2931,31 +2949,45 @@ static long calc_cfs_shares(struct cfs_r
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 
-static void update_cfs_shares(struct sched_entity *se)
+/*
+ * Recomputes the group entity based on the current state of its group
+ * runqueue.
+ */
+static void update_cfs_group(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	long shares;
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long shares, runnable;
 
-	if (!cfs_rq)
+	if (!gcfs_rq)
 		return;
 
-	if (throttled_hierarchy(cfs_rq))
+	if (throttled_hierarchy(gcfs_rq))
 		return;
 
 #ifndef CONFIG_SMP
-	shares = READ_ONCE(cfs_rq->tg->shares);
+	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
-	shares = calc_cfs_shares(cfs_rq);
+	shares = calc_cfs_shares(gcfs_rq);
+	/*
+	 * The hierarchical runnable load metric is the proportional part
+	 * of this group's runnable_load_avg / load_avg.
+	 *
+	 * Note: we need to deal with very sporadic 'runnable > load' cases
+	 * due to numerical instability.
+	 */
+	runnable = shares * gcfs_rq->avg.runnable_load_avg;
+	if (runnable)
+		runnable /= max(gcfs_rq->avg.load_avg, gcfs_rq->avg.runnable_load_avg);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares);
+	reweight_entity(cfs_rq_of(se), se, shares, runnable);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_shares(struct sched_entity *se)
+static inline void update_cfs_group(struct sched_entity *se)
 {
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -3062,7 +3094,7 @@ static u32 __accumulate_pelt_segments(u6
  */
 static __always_inline u32
 accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
-	       unsigned long weight, int running, struct cfs_rq *cfs_rq)
+	       unsigned long load, unsigned long runnable, int running)
 {
 	unsigned long scale_freq, scale_cpu;
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
@@ -3079,10 +3111,8 @@ accumulate_sum(u64 delta, int cpu, struc
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_sum =
-				decay_load(cfs_rq->runnable_load_sum, periods);
-		}
+		sa->runnable_load_sum =
+			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -3095,11 +3125,10 @@ accumulate_sum(u64 delta, int cpu, struc
 	sa->period_contrib = delta;
 
 	contrib = cap_scale(contrib, scale_freq);
-	if (weight) {
-		sa->load_sum += weight * contrib;
-		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * contrib;
-	}
+	if (load)
+		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib * scale_cpu;
 
@@ -3136,7 +3165,7 @@ accumulate_sum(u64 delta, int cpu, struc
  */
 static __always_inline int
 ___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
-		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
+		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
 
@@ -3169,8 +3198,8 @@ ___update_load_sum(u64 now, int cpu, str
 	 * this happens during idle_balance() which calls
 	 * update_blocked_averages()
 	 */
-	if (!weight)
-		running = 0;
+	if (!load)
+		runnable = running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -3179,45 +3208,60 @@ ___update_load_sum(u64 now, int cpu, str
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
+	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long weight, struct cfs_rq *cfs_rq)
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
 	/*
 	 * Step 2: update *_avg.
 	 */
-	sa->load_avg = div_u64(weight * sa->load_sum, divider);
-	if (cfs_rq) {
-		cfs_rq->runnable_load_avg =
-			div_u64(cfs_rq->runnable_load_sum, divider);
-	}
+	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	sa->util_avg = sa->util_sum / divider;
 }
 
 /*
  * sched_entity:
  *
+ *   task:
+ *     se_runnable() == se_weight()
+ *
+ *   group: [ see update_cfs_group() ]
+ *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
  *   load_sum := runnable_sum
  *   load_avg = se_weight(se) * runnable_avg
  *
+ *   runnable_load_sum := runnable_sum
+ *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
  * cfq_rs:
  *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
+ *
+ *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 static int
 __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
 {
-	if (___update_load_sum(now, cpu, &se->avg, 0, 0, NULL)) {
-		___update_load_avg(&se->avg, se_weight(se), NULL);
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		return 1;
 	}
 
@@ -3227,10 +3271,13 @@ __update_load_avg_blocked_se(u64 now, in
 static int
 __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq,
-				cfs_rq->curr == se, NULL)) {
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+				cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), NULL);
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		return 1;
 	}
 
@@ -3242,8 +3289,10 @@ __update_load_avg_cfs_rq(u64 now, int cp
 {
 	if (___update_load_sum(now, cpu, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				cfs_rq->curr != NULL, cfs_rq)) {
-		___update_load_avg(&cfs_rq->avg, 1, cfs_rq);
+				scale_load_down(cfs_rq->runnable_weight),
+				cfs_rq->curr != NULL)) {
+
+		___update_load_avg(&cfs_rq->avg, 1, 1);
 		return 1;
 	}
 
@@ -3421,8 +3470,8 @@ static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long load_avg;
-	s64 load_sum;
+	long runnable_load_avg, load_avg;
+	s64 runnable_load_sum, load_sum;
 
 	if (!runnable_sum)
 		return;
@@ -3438,9 +3487,15 @@ update_tg_cfs_runnable(struct cfs_rq *cf
 	add_positive(&cfs_rq->avg.load_avg, load_avg);
 	add_positive(&cfs_rq->avg.load_sum, load_sum);
 
+	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
+	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+
+	add_positive(&se->avg.runnable_load_sum, runnable_sum);
+	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+
 	if (se->on_rq) {
-		add_positive(&cfs_rq->runnable_load_avg, load_avg);
-		add_positive(&cfs_rq->runnable_load_sum, load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
 	}
 }
 
@@ -3722,7 +3777,7 @@ void remove_entity_load_avg(struct sched
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->runnable_load_avg;
+	return cfs_rq->avg.runnable_load_avg;
 }
 
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
@@ -3892,8 +3947,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	update_cfs_group(se);
 	enqueue_runnable_load_avg(cfs_rq, se);
-	update_cfs_shares(se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -3999,7 +4054,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
-	update_cfs_shares(se);
+	update_cfs_group(se);
 
 	/*
 	 * Now advance min_vruntime if @se was the entity holding it back,
@@ -4182,7 +4237,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 * Ensure that runnable average is periodically updated.
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
-	update_cfs_shares(curr);
+	update_cfs_group(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
 	/*
@@ -5100,7 +5155,7 @@ enqueue_task_fair(struct rq *rq, struct
 			break;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_shares(se);
+		update_cfs_group(se);
 	}
 
 	if (!se)
@@ -5159,7 +5214,7 @@ static void dequeue_task_fair(struct rq
 			break;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_shares(se);
+		update_cfs_group(se);
 	}
 
 	if (!se)
@@ -7184,7 +7239,7 @@ static inline bool cfs_rq_is_decayed(str
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->runnable_load_sum)
+	if (cfs_rq->avg.runnable_load_sum)
 		return false;
 
 	return true;
@@ -9689,7 +9744,7 @@ int sched_group_set_shares(struct task_g
 		update_rq_clock(rq);
 		for_each_sched_entity(se) {
 			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
-			update_cfs_shares(se);
+			update_cfs_group(se);
 		}
 		rq_unlock_irqrestore(rq, &rf);
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -418,6 +418,7 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
+	unsigned long runnable_weight;
 	unsigned int nr_running, h_nr_running;
 
 	u64 exec_clock;
@@ -444,8 +445,6 @@ struct cfs_rq {
 	 * CFS load tracking
 	 */
 	struct sched_avg avg;
-	u64 runnable_load_sum;
-	unsigned long runnable_load_avg;
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 14/18] sched/fair: Synchonous PELT detach on load-balance migrate
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (12 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-syn-migrate.patch --]
[-- Type: text/plain, Size: 2682 bytes --]

Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.

While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3649,10 +3649,6 @@ void remove_entity_load_avg(struct sched
 	 * Similarly for groups, they will have passed through
 	 * post_init_entity_util_avg() before unregister_sched_fair_group()
 	 * calls this.
-	 *
-	 * XXX in case entity_is_task(se) && task_of(se)->on_rq == MIGRATING
-	 * we could actually get the right time, since we're called with
-	 * rq->lock held, see detach_task().
 	 */
 
 	sync_entity_load_avg(se);
@@ -6251,6 +6247,8 @@ select_task_rq_fair(struct task_struct *
 	return new_cpu;
 }
 
+static void detach_entity_cfs_rq(struct sched_entity *se);
+
 /*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6284,14 +6282,25 @@ static void migrate_task_rq_fair(struct
 		se->vruntime -= min_vruntime;
 	}
 
-	/*
-	 * We are supposed to update the task to "current" time, then its up to date
-	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
-	 * what current time is, so simply throw away the out-of-date time. This
-	 * will result in the wakee task is less decayed, but giving the wakee more
-	 * load sounds not bad.
-	 */
-	remove_entity_load_avg(&p->se);
+	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
+		/*
+		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
+		 * rq->lock and can modify state directly.
+		 */
+		lockdep_assert_held(&task_rq(p)->lock);
+		detach_entity_cfs_rq(&p->se);
+
+	} else {
+		/*
+		 * We are supposed to update the task to "current" time, then
+		 * its up to date and ready to go to new CPU/cfs_rq. But we
+		 * have difficulty in getting what current time is, so simply
+		 * throw away the out-of-date time. This will result in the
+		 * wakee task is less decayed, but giving the wakee more load
+		 * sounds not bad.
+		 */
+		remove_entity_load_avg(&p->se);
+	}
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (13 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 14/18] sched/fair: Synchonous PELT detach on load-balance migrate Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-10-04 19:27   ` Dietmar Eggemann
  2017-09-01 13:21 ` [PATCH -v2 16/18] sched/fair: More accurate async detach Peter Zijlstra
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-align-windows.patch --]
[-- Type: text/plain, Size: 3221 bytes --]

The PELT _sum values are a saw-tooth function, dropping on the decay
edge and then growing back up again during the window.

When these window-edges are not aligned between cfs_rq and se, we can
have the situation where, for example, on dequeue, the se decays
first.

Its _sum values will be small(er), while the cfs_rq _sum values will
still be on their way up. Because of this, the subtraction:
cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
will then, once the cfs_rq reaches an edge, translate into its _avg
value jumping up.

This is especially visible with the runnable_load bits, since they get
added/subtracted a lot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -729,13 +729,8 @@ void init_entity_runnable_average(struct
 {
 	struct sched_avg *sa = &se->avg;
 
-	sa->last_update_time = 0;
-	/*
-	 * sched_avg's period_contrib should be strictly less then 1024, so
-	 * we give it 1023 to make sure it is almost a period (1024us), and
-	 * will definitely be update (after enqueue).
-	 */
-	sa->period_contrib = 1023;
+	memset(sa, 0, sizeof(*sa));
+
 	/*
 	 * Tasks are intialized with full load to be seen as heavy tasks until
 	 * they get a chance to stabilize to their real load level.
@@ -744,13 +739,9 @@ void init_entity_runnable_average(struct
 	 */
 	if (entity_is_task(se))
 		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-	sa->runnable_load_sum = sa->load_sum = LOAD_AVG_MAX;
 
-	/*
-	 * At this point, util_avg won't be used in select_task_rq_fair anyway
-	 */
-	sa->util_avg = 0;
-	sa->util_sum = 0;
+	se->runnable_weight = se->load.weight;
+
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
@@ -798,7 +789,6 @@ void post_init_entity_util_avg(struct sc
 		} else {
 			sa->util_avg = cap;
 		}
-		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
 	if (entity_is_task(se)) {
@@ -3644,7 +3634,34 @@ update_cfs_rq_load_avg(u64 now, struct c
  */
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
+
+	/*
+	 * When we attach the @se to the @cfs_rq, we must align the decay
+	 * window because without that, really weird and wonderful things can
+	 * happen.
+	 *
+	 * XXX illustrate
+	 */
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
+	se->avg.period_contrib = cfs_rq->avg.period_contrib;
+
+	/*
+	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
+	 * period_contrib. This isn't strictly correct, but since we're
+	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
+	 * _sum a little.
+	 */
+	se->avg.util_sum = se->avg.util_avg * divider;
+
+	se->avg.load_sum = divider;
+	if (se_weight(se)) {
+		se->avg.load_sum =
+			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
+	}
+
+	se->avg.runnable_load_sum = se->avg.load_sum;
+
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 16/18] sched/fair: More accurate async detach
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (14 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 17/18] sched/fair: Calculate runnable_weight slightly differently Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 18/18] sched/fair: Update calc_group_*() comments Peter Zijlstra
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-more-accurate-remove.patch --]
[-- Type: text/plain, Size: 1379 bytes --]

The problem with the overestimate is that it will subtract too big a
value from the load_sum, thereby pushing it down further than it ought
to go. Since runnable_load_avg is not subject to a similar 'force',
this results in the occasional 'runnable_load > load' situation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3469,6 +3469,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 
 	if (cfs_rq->removed.nr) {
 		unsigned long r;
+		u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
@@ -3477,17 +3478,13 @@ update_cfs_rq_load_avg(u64 now, struct c
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
-		/*
-		 * The LOAD_AVG_MAX for _sum is a slight over-estimate,
-		 * which is safe due to sub_positive() clipping at 0.
-		 */
 		r = removed_load;
 		sub_positive(&sa->load_avg, r);
-		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
+		sub_positive(&sa->load_sum, r * divider);
 
 		r = removed_util;
 		sub_positive(&sa->util_avg, r);
-		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
+		sub_positive(&sa->util_sum, r * divider);
 
 		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 17/18] sched/fair: Calculate runnable_weight slightly differently
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (15 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 16/18] sched/fair: More accurate async detach Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  2017-09-01 13:21 ` [PATCH -v2 18/18] sched/fair: Update calc_group_*() comments Peter Zijlstra
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz, mingo, riel,
	kernel-team

[-- Attachment #1: josef_bacik-sched_fair-calculate_runnable_weight_slightly_differently.patch --]
[-- Type: text/plain, Size: 3740 bytes --]

From: Josef Bacik <jbacik@fb.com>

Our runnable_weight currently looks like this

runnable_weight = shares * runnable_load_avg / load_avg

The goal is to scale the runnable weight for the group based on its runnable to
load_avg ratio.  The problem with this is it biases us towards tasks that never
go to sleep.  Tasks that go to sleep are going to have their runnable_load_avg
decayed pretty hard, which will drastically reduce the runnable weight of groups
with interactive tasks.  To solve this imbalance we tweak this slightly, so in
the ideal case it is still the above, but in the interactive case it is

runnable_weight = shares * runnable_weight / load_weight

which will make the weight distribution fairer between interactive and
non-interactive groups.

Cc: mingo@redhat.com
Cc: riel@redhat.com
Cc: kernel-team@fb.com
Cc: tj@kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com
---
 kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 12 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2897,7 +2897,7 @@ void reweight_task(struct task_struct *p
  *
  * hence icky!
  */
-static long calc_cfs_shares(struct cfs_rq *cfs_rq)
+static long calc_group_shares(struct cfs_rq *cfs_rq)
 {
 	long tg_weight, tg_shares, load, shares;
 	struct task_group *tg = cfs_rq->tg;
@@ -2934,6 +2934,36 @@ static long calc_cfs_shares(struct cfs_r
 	 */
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
+
+/*
+ * The runnable shares of this group are calculated as such
+ *
+ *          max(cfs_rq->avg.runnable_load_avg, cfs_rq->runnable_weight)
+ * shares * ------------------------------------------------------------
+ *               max(cfs_rq->avg.load_avg, cfs_rq->load.weight)
+ *
+ * We do this to keep the shares in line with expected load on the cfs_rq.
+ * Consider a cfs_rq that has several tasks wake up on this cfs_rq for the first
+ * time, it's runnable_load_avg is not going to be representative of the actual
+ * load this cfs_rq will now experience, which will bias us agaisnt this cfs_rq.
+ * The weight on the cfs_rq is the immediate effect of having new tasks
+ * enqueue'd onto it which should be used to calculate the new runnable shares.
+ * At the same time we need the actual load_avg to be the lower bounds for the
+ * calculation, to handle when our weight drops quickly from having entities
+ * dequeued.
+ */
+static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
+{
+	long load_avg = max(cfs_rq->avg.load_avg,
+			    scale_load_down(cfs_rq->load.weight));
+	long runnable = max(cfs_rq->avg.runnable_load_avg,
+			    scale_load_down(cfs_rq->runnable_weight));
+
+	runnable *= shares;
+	if (load_avg)
+		runnable /= load_avg;
+	return clamp_t(long, runnable, MIN_SHARES, shares);
+}
 # endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -2959,17 +2989,8 @@ static void update_cfs_group(struct sche
 	if (likely(se->load.weight == shares))
 		return;
 #else
-	shares = calc_cfs_shares(gcfs_rq);
-	/*
-	 * The hierarchical runnable load metric is the proportional part
-	 * of this group's runnable_load_avg / load_avg.
-	 *
-	 * Note: we need to deal with very sporadic 'runnable > load' cases
-	 * due to numerical instability.
-	 */
-	runnable = shares * gcfs_rq->avg.runnable_load_avg;
-	if (runnable)
-		runnable /= max(gcfs_rq->avg.load_avg, gcfs_rq->avg.runnable_load_avg);
+	shares   = calc_group_shares(gcfs_rq);
+	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
 	reweight_entity(cfs_rq_of(se), se, shares, runnable);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH -v2 18/18] sched/fair: Update calc_group_*() comments
  2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
                   ` (16 preceding siblings ...)
  2017-09-01 13:21 ` [PATCH -v2 17/18] sched/fair: Calculate runnable_weight slightly differently Peter Zijlstra
@ 2017-09-01 13:21 ` Peter Zijlstra
  17 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-01 13:21 UTC (permalink / raw)
  To: mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, dietmar.eggemann,
	morten.rasmussen, bsegall, yuyang.du, peterz

[-- Attachment #1: peterz-sched-fair-comment-calc_group_runnable.patch --]
[-- Type: text/plain, Size: 4617 bytes --]

I had a wee bit of trouble recalling how the calc_group_runnable()
stuff worked.. add hopefully better comments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   68 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 23 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2874,7 +2874,7 @@ void reweight_task(struct task_struct *p
  * Now, in that special case (1) reduces to:
  *
  *                     tg->weight * grq->load.weight
- *   ge->load.weight = ----------------------------- = tg>weight   (4)
+ *   ge->load.weight = ----------------------------- = tg->weight   (4)
  *			    grp->load.weight
  *
  * That is, the sum collapses because all other CPUs are idle; the UP scenario.
@@ -2888,6 +2888,18 @@ void reweight_task(struct task_struct *p
  *     ---------------------------------------------------         (5)
  *     tg->load_avg - grq->avg.load_avg + grq->load.weight
  *
+ * But because grq->load.weight can drop to 0, resulting in a divide by zero,
+ * we need to use grq->avg.load_avg as its lower bound, which then gives:
+ *
+ *
+ *                     tg->weight * grq->load.weight
+ *   ge->load.weight = -----------------------------		   (6)
+ *				tg_load_avg'
+ *
+ * Where:
+ *
+ *   tg_load_avg' = tg->load_avg - grq->avg.load_avg +
+ *                  max(grq->load.weight, grq->avg.load_avg)
  *
  * And that is shares_weight and is icky. In the (near) UP case it approaches
  * (4) while in the normal case it approaches (3). It consistently
@@ -2904,10 +2916,6 @@ static long calc_group_shares(struct cfs
 
 	tg_shares = READ_ONCE(tg->shares);
 
-	/*
-	 * Because (5) drops to 0 when the cfs_rq is idle, we need to use (3)
-	 * as a lower bound.
-	 */
 	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
 
 	tg_weight = atomic_long_read(&tg->load_avg);
@@ -2936,32 +2944,46 @@ static long calc_group_shares(struct cfs
 }
 
 /*
- * The runnable shares of this group are calculated as such
+ * This calculates the effective runnable weight for a group entity based on
+ * the group entity weight calculated above.
+ *
+ * Because of the above approximation (2), our group entity weight is
+ * an load_avg based ratio (3). This means that it includes blocked load and
+ * does not represent the runnable weight.
  *
- *          max(cfs_rq->avg.runnable_load_avg, cfs_rq->runnable_weight)
- * shares * ------------------------------------------------------------
- *               max(cfs_rq->avg.load_avg, cfs_rq->load.weight)
- *
- * We do this to keep the shares in line with expected load on the cfs_rq.
- * Consider a cfs_rq that has several tasks wake up on this cfs_rq for the first
- * time, it's runnable_load_avg is not going to be representative of the actual
- * load this cfs_rq will now experience, which will bias us agaisnt this cfs_rq.
- * The weight on the cfs_rq is the immediate effect of having new tasks
- * enqueue'd onto it which should be used to calculate the new runnable shares.
- * At the same time we need the actual load_avg to be the lower bounds for the
- * calculation, to handle when our weight drops quickly from having entities
- * dequeued.
+ * Approximate the group entity's runnable weight per ratio from the group
+ * runqueue:
+ *
+ *					     grq->avg.runnable_load_avg
+ *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
+ *						 grq->avg.load_avg
+ *
+ * However, analogous to above, since the avg numbers are slow, this leads to
+ * transients in the from-idle case. Instead we use:
+ *
+ *   ge->runnable_weight = ge->load.weight *
+ *
+ *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
+ *		-----------------------------------------------------	(8)
+ *		      max(grq->avg.load_avg, grq->load.weight)
+ *
+ * Where these max() serve both to use the 'instant' values to fix the slow
+ * from-idle and avoid the /0 on to-idle, similar to (6).
  */
 static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
 {
-	long load_avg = max(cfs_rq->avg.load_avg,
-			    scale_load_down(cfs_rq->load.weight));
-	long runnable = max(cfs_rq->avg.runnable_load_avg,
-			    scale_load_down(cfs_rq->runnable_weight));
+	long runnable, load_avg;
+
+	load_avg = max(cfs_rq->avg.load_avg,
+		       scale_load_down(cfs_rq->load.weight));
+
+	runnable = max(cfs_rq->avg.runnable_load_avg,
+		       scale_load_down(cfs_rq->runnable_weight));
 
 	runnable *= shares;
 	if (load_avg)
 		runnable /= load_avg;
+
 	return clamp_t(long, runnable, MIN_SHARES, shares);
 }
 # endif /* CONFIG_SMP */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()
  2017-09-01 13:21 ` [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
@ 2017-09-28 10:03   ` Morten Rasmussen
  2017-09-29 11:35     ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Rasmussen @ 2017-09-28 10:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 01, 2017 at 03:21:01PM +0200, Peter Zijlstra wrote:
> Explain the magic equation in calc_cfs_shares() a bit better.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  # ifdef CONFIG_SMP
> +/*
> + * All this does is approximate the hierarchical proportion which includes that
> + * global sum we all love to hate.
> + *
> + * That is, the weight of a group entity, is the proportional share of the
> + * group weight based on the group runqueue weights. That is:
> + *
> + *                     tg->weight * grq->load.weight
> + *   ge->load.weight = -----------------------------               (1)
> + *			  \Sum grq->load.weight
> + *
> + * Now, because computing that sum is prohibitively expensive to compute (been
> + * there, done that) we approximate it with this average stuff. The average
> + * moves slower and therefore the approximation is cheaper and more stable.
> + *
> + * So instead of the above, we substitute:
> + *
> + *   grq->load.weight -> grq->avg.load_avg                         (2)
> + *
> + * which yields the following:
> + *
> + *                     tg->weight * grq->avg.load_avg
> + *   ge->load.weight = ------------------------------              (3)
> + *				tg->load_avg
> + *
> + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> + *
> + * That is shares_avg, and it is right (given the approximation (2)).
> + *
> + * The problem with it is that because the average is slow -- it was designed
> + * to be exactly that of course -- this leads to transients in boundary
> + * conditions. In specific, the case where the group was idle and we start the
> + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> + * yielding bad latency etc..
> + *
> + * Now, in that special case (1) reduces to:
> + *
> + *                     tg->weight * grq->load.weight
> + *   ge->load.weight = ----------------------------- = tg>weight   (4)
> + *			    grp->load.weight

Should it be "grq->load.weight" in the denominator of (4)?
And "tg->weight" at the end?

> + *
> + * That is, the sum collapses because all other CPUs are idle; the UP scenario.

Shouldn't (3) collapse in the same way too in this special case? In
theory it should reduce to:

                     tg->weight * grq->avg.load_avg
   ge->load.weight = ------------------------------
			grq->avg.load_avg


But I can see many reasons why it won't happen in practice if things
aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
(3) aren't in sync, or there are stale contributions to tg->load_avg
from other cpus then (3) can return anything between 0 and tg->weight.

> + *
> + * So what we do is modify our approximation (3) to approach (4) in the (near)
> + * UP case, like:
> + *
> + *   ge->load.weight =
> + *
> + *              tg->weight * grq->load.weight
> + *     ---------------------------------------------------         (5)
> + *     tg->load_avg - grq->avg.load_avg + grq->load.weight
> + *
> + *
> + * And that is shares_weight and is icky. In the (near) UP case it approaches
> + * (4) while in the normal case it approaches (3). It consistently
> + * overestimates the ge->load.weight and therefore:
> + *
> + *   \Sum ge->load.weight >= tg->weight
> + *
> + * hence icky!

IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
tasks, you can end up with underestimating the ge->load.weight for some
of the group entities lead to \Sum ge->load.weight < tg->weight.

Let's take a simple example:

Two cpus, one task group with three tasks in it: An always-running task
on both cpus, and an additional periodic task currently blocked on cpu 0
(contributing 512 to grq->avg.load_avg on cpu 0).

tg->weight		= 1024
tg->load_avg		= 2560
\Sum grq->load.weight	= 2048

cpu			0	1	\Sum
grq->avg.load_avg	1536	1024
grq->load.weight	1024	1024
ge->load_weight (1)	512	512	1024 >= tg->weight
ge->load_weight (3)	614	410	1024 >= tg->weight
ge->load_weight (5)	512	410	922 < tg->weight

So with (5) we are missing 102 worth of ge->load.weight.

If (1), the instantaneous ge->load.weight, is what we want, then
ge->load.weight of cpu 1 is underestimated, if (3), shares_avg, is the
goal, then ge->load.weight of cpu 0 is underestimated.

The "missing" ge->load.weight can get much larger if the blocked task
had higher priority.

Another thing is that we are loosing a bit of the nice stability that
(3) provides if you have periodic tasks.

I'm not sure if we can do better than (5), I'm just trying to understand
how the approximation will behave and make sure we understand the
implications.

Morten

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity()
  2017-09-01 13:21 ` [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
@ 2017-09-29  9:04   ` Morten Rasmussen
  2017-09-29 11:38     ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Rasmussen @ 2017-09-29  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 01, 2017 at 03:21:02PM +0200, Peter Zijlstra wrote:
> Vincent reported that when running in a cgroup, his root
> cfs_rq->avg.load_avg dropped to 0 on task idle.
> 
> This is because reweight_entity() will now immediately propagate the
> weight change of the group entity to its cfs_rq, and as it happens,
> our approxmation (5) for calc_cfs_shares() results in 0 when the group
> is idle.
> 
> Avoid this by using the correct (3) as a lower bound on (5). This way
> the empty cgroup will slowly decay instead of instantly drop to 0.
> 
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |    7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2703,11 +2703,10 @@ static long calc_cfs_shares(struct cfs_r
>  	tg_shares = READ_ONCE(tg->shares);
>  
>  	/*
> -	 * This really should be: cfs_rq->avg.load_avg, but instead we use
> -	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
> -	 * the shares for small weight interactive tasks.
> +	 * Because (5) drops to 0 when the cfs_rq is idle, we need to use (3)
> +	 * as a lower bound.
>  	 */
> -	load = scale_load_down(cfs_rq->load.weight);
> +	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);

We use cfs_rq->tg_load_avg_contrib (the filtered version of
cfs_rq->avg.load_avg) instead of cfs_rq->avg.load_avg further down, so I
think we should here too for consistency.

+	load = max(scale_load_down(cfs_rq->load.weight),
+		   cfs_rq->tg_load_avg_contrib);

With this change (5) almost becomes (3):

   ge->load.weight =

                 tg->weight * max(grq->load.weight, grq->avg.load_avg)
     ---------------------------------------------------------------------------
     tg->load_avg - grq->avg.load_avg + max(grq->load.weight, grq->avg.load_avg)

The difference is that we boost ge->load.weight for if the grq has
runnable tasks with se->avg.load_avg < se->load.weight, i.e. tasks that
occasionally block. This means that the underestimate scenario I have in
my reply for patch #2 is no longer possible. AFAICT, we are now
guaranteed to over-estimate ge->load.weight. It is still quite sensitive
to periodic high priority tasks though.

tg->weight              = 1024
tg->load_avg            = 2560
\Sum grq->load.weight   = 2048

cpu                     0       1       \Sum
grq->avg.load_avg       1536    1024
grq->load.weight        1024    1024
load (max)		1536	1024
ge->load_weight (1)     512     512     1024 >= tg->weight
ge->load_weight (3)     614     410     1024 >= tg->weight
ge->load_weight (5)     512     410     922 < tg->weight
ge->load_weight (5*)    614     410     1024 >= tg->weight

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()
  2017-09-28 10:03   ` Morten Rasmussen
@ 2017-09-29 11:35     ` Peter Zijlstra
  2017-09-29 13:03       ` Morten Rasmussen
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-29 11:35 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:

> > +/*
> > + * All this does is approximate the hierarchical proportion which includes that
> > + * global sum we all love to hate.
> > + *
> > + * That is, the weight of a group entity, is the proportional share of the
> > + * group weight based on the group runqueue weights. That is:
> > + *
> > + *                     tg->weight * grq->load.weight
> > + *   ge->load.weight = -----------------------------               (1)
> > + *			  \Sum grq->load.weight
> > + *
> > + * Now, because computing that sum is prohibitively expensive to compute (been
> > + * there, done that) we approximate it with this average stuff. The average
> > + * moves slower and therefore the approximation is cheaper and more stable.
> > + *
> > + * So instead of the above, we substitute:
> > + *
> > + *   grq->load.weight -> grq->avg.load_avg                         (2)
> > + *
> > + * which yields the following:
> > + *
> > + *                     tg->weight * grq->avg.load_avg
> > + *   ge->load.weight = ------------------------------              (3)
> > + *				tg->load_avg
> > + *
> > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> > + *
> > + * That is shares_avg, and it is right (given the approximation (2)).
> > + *
> > + * The problem with it is that because the average is slow -- it was designed
> > + * to be exactly that of course -- this leads to transients in boundary
> > + * conditions. In specific, the case where the group was idle and we start the
> > + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> > + * yielding bad latency etc..
> > + *
> > + * Now, in that special case (1) reduces to:
> > + *
> > + *                     tg->weight * grq->load.weight
> > + *   ge->load.weight = ----------------------------- = tg>weight   (4)
> > + *			    grp->load.weight
> 
> Should it be "grq->load.weight" in the denominator of (4)?
> And "tg->weight" at the end?

Yes, otherwise its all doesn't really make sense :-) Typing is hard it
seems.

> > + *
> > + * That is, the sum collapses because all other CPUs are idle; the UP scenario.
> 
> Shouldn't (3) collapse in the same way too in this special case?

That's more difficult to see and (1) is the canonical form.

> In theory it should reduce to:
> 
>                      tg->weight * grq->avg.load_avg
>    ge->load.weight = ------------------------------
> 			grq->avg.load_avg

Yes, agreed.

> But I can see many reasons why it won't happen in practice if things
> aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
> (3) aren't in sync, or there are stale contributions to tg->load_avg
> from other cpus then (3) can return anything between 0 and tg->weight.

Just so.

> > + *
> > + * So what we do is modify our approximation (3) to approach (4) in the (near)
> > + * UP case, like:
> > + *
> > + *   ge->load.weight =
> > + *
> > + *              tg->weight * grq->load.weight
> > + *     ---------------------------------------------------         (5)
> > + *     tg->load_avg - grq->avg.load_avg + grq->load.weight
> > + *
> > + *
> > + * And that is shares_weight and is icky. In the (near) UP case it approaches
> > + * (4) while in the normal case it approaches (3). It consistently
> > + * overestimates the ge->load.weight and therefore:
> > + *
> > + *   \Sum ge->load.weight >= tg->weight
> > + *
> > + * hence icky!
> 
> IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> tasks, you can end up with underestimating the ge->load.weight for some
> of the group entities lead to \Sum ge->load.weight < tg->weight.

Ah yes, you're right. However, if you look at the end of the series we
actually end up with using:

	max(grq->load.weight, grq->avg.load_avg)

Which I suppose makes it true again.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity()
  2017-09-29  9:04   ` Morten Rasmussen
@ 2017-09-29 11:38     ` Peter Zijlstra
  2017-09-29 13:00       ` Morten Rasmussen
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-29 11:38 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 29, 2017 at 10:04:34AM +0100, Morten Rasmussen wrote:

> > -	load = scale_load_down(cfs_rq->load.weight);
> > +	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> 
> We use cfs_rq->tg_load_avg_contrib (the filtered version of
> cfs_rq->avg.load_avg) instead of cfs_rq->avg.load_avg further down, so I
> think we should here too for consistency.
> 
> +	load = max(scale_load_down(cfs_rq->load.weight),
> +		   cfs_rq->tg_load_avg_contrib);
> 

No; we must use tg_load_avg_contrib because that is what's inclded in
tg_weight, but we want to add the most up-to-date value back in.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity()
  2017-09-29 11:38     ` Peter Zijlstra
@ 2017-09-29 13:00       ` Morten Rasmussen
  0 siblings, 0 replies; 57+ messages in thread
From: Morten Rasmussen @ 2017-09-29 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 29, 2017 at 01:38:53PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 29, 2017 at 10:04:34AM +0100, Morten Rasmussen wrote:
> 
> > > -	load = scale_load_down(cfs_rq->load.weight);
> > > +	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> > 
> > We use cfs_rq->tg_load_avg_contrib (the filtered version of
> > cfs_rq->avg.load_avg) instead of cfs_rq->avg.load_avg further down, so I
> > think we should here too for consistency.
> > 
> > +	load = max(scale_load_down(cfs_rq->load.weight),
> > +		   cfs_rq->tg_load_avg_contrib);
> > 
> 
> No; we must use tg_load_avg_contrib because that is what's inclded in
> tg_weight, but we want to add the most up-to-date value back in.

Agreed. Looking at it again it make sense.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()
  2017-09-29 11:35     ` Peter Zijlstra
@ 2017-09-29 13:03       ` Morten Rasmussen
  0 siblings, 0 replies; 57+ messages in thread
From: Morten Rasmussen @ 2017-09-29 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 29, 2017 at 01:35:00PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:
> > IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> > tasks, you can end up with underestimating the ge->load.weight for some
> > of the group entities lead to \Sum ge->load.weight < tg->weight.
> 
> Ah yes, you're right. However, if you look at the end of the series we
> actually end up with using:
> 
> 	max(grq->load.weight, grq->avg.load_avg)
> 
> Which I suppose makes it true again.

Yes, with the next patch in the series, underestimation is no longer
possible.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-09-01 13:21 ` [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
@ 2017-09-29 15:26   ` Morten Rasmussen
  2017-09-29 16:39     ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Rasmussen @ 2017-09-29 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 01, 2017 at 03:21:03PM +0200, Peter Zijlstra wrote:
> +/*
> + * sched_entity:
> + *
> + *   load_sum := runnable_sum
> + *   load_avg = se_weight(se) * runnable_avg
> + *
> + * cfq_rs:

I think this should be "cfs_rq" instead.

> + *
> + *   load_sum = \Sum se_weight(se) * se->avg.load_sum
> + *   load_avg = \Sum se->avg.load_avg
> + */

I find it a bit confusing that load_sum and load_avg have different
definitions, but I guess I will discover why dropping weight from
se->avg.load_sum helps a bit later. We can't do the same for cfs_rq as
it is a \Sum of sums where we add/remove contributions when tasks
migrate.

Have we defined the relation between runnable_sum and runnable_avg in a
comment somewhere already? Otherwise it might be helpful to add. It is
in the code of course :-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum
  2017-09-29 15:26   ` Morten Rasmussen
@ 2017-09-29 16:39     ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-09-29 16:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 29, 2017 at 04:26:10PM +0100, Morten Rasmussen wrote:
> On Fri, Sep 01, 2017 at 03:21:03PM +0200, Peter Zijlstra wrote:
> > +/*
> > + * sched_entity:
> > + *
> > + *   load_sum := runnable_sum
> > + *   load_avg = se_weight(se) * runnable_avg
> > + *
> > + * cfq_rs:
> 
> I think this should be "cfs_rq" instead.

Quite.

> > + *
> > + *   load_sum = \Sum se_weight(se) * se->avg.load_sum
> > + *   load_avg = \Sum se->avg.load_avg
> > + */
> 
> I find it a bit confusing that load_sum and load_avg have different
> definitions, but I guess I will discover why dropping weight from
> se->avg.load_sum helps a bit later.

It makes changing the weight cheaper, we don't have to divide the old
weight out and then multiple by the new weight.

Up until this patch set, we mixed the old and new weight in the sum,
with the result that a weight change takes a while to propagate.

But we want to change weight instantly. See:

  sched/fair: More accurate reweight_entity()

> We can't do the same for cfs_rq as it is a \Sum of sums where we
> add/remove contributions when tasks migrate.

Just so. The patch:

  sched/fair: Rewrite PELT migration propagation

Has text on this:

+ * Because while for entities historical weight is not important and we
+ * really only care about our future and therefore can consider a pure
+ * runnable sum, runqueues can NOT do this.
+ *
+ * We specifically want runqueues to have a load_avg that includes
+ * historical weights. Those represent the blocked load, the load we expect
+ * to (shortly) return to us. This only works by keeping the weights as
+ * integral part of the sum. We therefore cannot decompose as per (3).

> Have we defined the relation between runnable_sum and runnable_avg in a
> comment somewhere already? Otherwise it might be helpful to add. It is
> in the code of course :-)

I think this comment here is about it... But who knows. Please keep this
as a note while you read the rest of the series. If you find its still
insufficiently addressed at the end, I'm sure we can do another patch
with moar comments still ;-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg
  2017-09-01 13:21 ` [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
@ 2017-10-02 17:46   ` Dietmar Eggemann
  2017-10-03  8:50     ` Peter Zijlstra
  2017-10-03 12:26   ` Dietmar Eggemann
  1 sibling, 1 reply; 57+ messages in thread
From: Dietmar Eggemann @ 2017-10-02 17:46 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, morten.rasmussen,
	bsegall, yuyang.du

On 01/09/17 14:21, Peter Zijlstra wrote:
> The load balancer uses runnable_load_avg as load indicator. For
> !cgroup this is:
> 
>   runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
> 
> That is, a direct sum of all runnable tasks on that runqueue. As
> opposed to load_avg, which is a sum of all tasks on the runqueue,
> which includes a blocked component.
> 
> However, in the cgroup case, this comes apart since the group entities
> are always runnable, even if most of their constituent entities are
> blocked.
> 
> Therefore introduce a runnable_weight which for task entities is the
> same as the regular weight, but for group entities is a fraction of
> the entity weight and represents the runnable part of the group
> runqueue.
> 
> Then propagate this load through the PELT hierarchy to arrive at an
> effective runnable load avgerage -- which we should not confuse with
> the canonical runnable load average.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h |    3 
>  kernel/sched/debug.c  |    8 ++
>  kernel/sched/fair.c   |  173 ++++++++++++++++++++++++++++++++------------------
>  kernel/sched/sched.h  |    3 
>  4 files changed, 125 insertions(+), 62 deletions(-)

[...]

> @@ -2931,31 +2949,45 @@ static long calc_cfs_shares(struct cfs_r
>  
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>  
> -static void update_cfs_shares(struct sched_entity *se)
> +/*
> + * Recomputes the group entity based on the current state of its group
> + * runqueue.
> + */
> +static void update_cfs_group(struct sched_entity *se)

update_cfs_share(s)() is still mentioned in the function header of
update_tg_load_avg() and update_cfs_rq_load_avg().

Should we rename those comments with this patch?

IMHO, the comment for update_tg_load_avg() is still true whereas the one
for update_cfs_rq_load_avg() mentions cfs_rq->avg as
cfs_rq->avg.load_avg (or cfs_rq_load_avg()) and update_cfs_group()
doesn't use it anymore. It's now used in calc_group_runnable() and
calc_group_shares() instead.

[...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg
  2017-10-02 17:46   ` Dietmar Eggemann
@ 2017-10-03  8:50     ` Peter Zijlstra
  2017-10-03  9:29       ` Dietmar Eggemann
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-03  8:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, morten.rasmussen, bsegall, yuyang.du

On Mon, Oct 02, 2017 at 06:46:32PM +0100, Dietmar Eggemann wrote:

> > +/*
> > + * Recomputes the group entity based on the current state of its group
> > + * runqueue.
> > + */
> > +static void update_cfs_group(struct sched_entity *se)
> 
> update_cfs_share(s)() is still mentioned in the function header of
> update_tg_load_avg() and update_cfs_rq_load_avg().
> 
> Should we rename those comments with this patch?
> 
> IMHO, the comment for update_tg_load_avg() is still true whereas the one
> for update_cfs_rq_load_avg() mentions cfs_rq->avg as
> cfs_rq->avg.load_avg (or cfs_rq_load_avg()) and update_cfs_group()
> doesn't use it anymore. It's now used in calc_group_runnable() and
> calc_group_shares() instead.
> 
> [...]

Right, so something like the below? Thinking that update_cfs_group()
immediately leads to calc_group_*() so no need to spell those out.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec01523..fee2e34812da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,7 +3333,7 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
  * differential update where we store the last value we propagated. This in
  * turn allows skipping updates if the differential is 'small'.
  *
- * Updating tg's load_avg is necessary before update_cfs_share().
+ * Updating tg's load_avg is necessary before update_cfs_group().
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
@@ -3601,7 +3601,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
  * avg. The immediate corollary is that all (fair) tasks must be attached, see
  * post_init_entity_util_avg().
  *
- * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
+ * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
  *
  * Returns true if the load decayed or we removed load.
  *

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg
  2017-10-03  8:50     ` Peter Zijlstra
@ 2017-10-03  9:29       ` Dietmar Eggemann
  0 siblings, 0 replies; 57+ messages in thread
From: Dietmar Eggemann @ 2017-10-03  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, morten.rasmussen, bsegall, yuyang.du

On 03/10/17 09:50, Peter Zijlstra wrote:
> On Mon, Oct 02, 2017 at 06:46:32PM +0100, Dietmar Eggemann wrote:
> 
>>> +/*
>>> + * Recomputes the group entity based on the current state of its group
>>> + * runqueue.
>>> + */
>>> +static void update_cfs_group(struct sched_entity *se)
>>
>> update_cfs_share(s)() is still mentioned in the function header of
>> update_tg_load_avg() and update_cfs_rq_load_avg().
>>
>> Should we rename those comments with this patch?
>>
>> IMHO, the comment for update_tg_load_avg() is still true whereas the one
>> for update_cfs_rq_load_avg() mentions cfs_rq->avg as
>> cfs_rq->avg.load_avg (or cfs_rq_load_avg()) and update_cfs_group()
>> doesn't use it anymore. It's now used in calc_group_runnable() and
>> calc_group_shares() instead.
>>
>> [...]
> 
> Right, so something like the below? Thinking that update_cfs_group()
> immediately leads to calc_group_*() so no need to spell those out.

Looks good to me, agreed.

> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 350dbec01523..fee2e34812da 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3333,7 +3333,7 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>   * differential update where we store the last value we propagated. This in
>   * turn allows skipping updates if the differential is 'small'.
>   *
> - * Updating tg's load_avg is necessary before update_cfs_share().
> + * Updating tg's load_avg is necessary before update_cfs_group().
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {
> @@ -3601,7 +3601,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>   * avg. The immediate corollary is that all (fair) tasks must be attached, see
>   * post_init_entity_util_avg().
>   *
> - * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
> + * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
>   *
>   * Returns true if the load decayed or we removed load.
>   *
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg
  2017-09-01 13:21 ` [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
  2017-10-02 17:46   ` Dietmar Eggemann
@ 2017-10-03 12:26   ` Dietmar Eggemann
  1 sibling, 0 replies; 57+ messages in thread
From: Dietmar Eggemann @ 2017-10-03 12:26 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, morten.rasmussen,
	bsegall, yuyang.du

On 01/09/17 14:21, Peter Zijlstra wrote:

[...]

> @@ -3169,8 +3198,8 @@ ___update_load_sum(u64 now, int cpu, str
>  	 * this happens during idle_balance() which calls
>  	 * update_blocked_averages()
>  	 */

another nit-pick:

s/weight/load in comment above.

> -	if (!weight)
> -		running = 0;
> +	if (!load)
> +		runnable = running = 0;

[...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se
  2017-09-01 13:21 ` [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
@ 2017-10-04 19:27   ` Dietmar Eggemann
  2017-10-06 13:02     ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Dietmar Eggemann @ 2017-10-04 19:27 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, linux-kernel, tj, josef
  Cc: torvalds, vincent.guittot, efault, pjt, clm, morten.rasmussen,
	bsegall, yuyang.du

On 01/09/17 14:21, Peter Zijlstra wrote:
> The PELT _sum values are a saw-tooth function, dropping on the decay
> edge and then growing back up again during the window.
> 
> When these window-edges are not aligned between cfs_rq and se, we can
> have the situation where, for example, on dequeue, the se decays
> first.
> 
> Its _sum values will be small(er), while the cfs_rq _sum values will
> still be on their way up. Because of this, the subtraction:
> cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
> will then, once the cfs_rq reaches an edge, translate into its _avg
> value jumping up.
> 
> This is especially visible with the runnable_load bits, since they get
> added/subtracted a lot.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++--------------
>  1 file changed, 31 insertions(+), 14 deletions(-)

[...]

> @@ -3644,7 +3634,34 @@ update_cfs_rq_load_avg(u64 now, struct c
>   */
>  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
> +
> +	/*
> +	 * When we attach the @se to the @cfs_rq, we must align the decay
> +	 * window because without that, really weird and wonderful things can
> +	 * happen.
> +	 *
> +	 * XXX illustrate
> +	 */
>  	se->avg.last_update_time = cfs_rq->avg.last_update_time;
> +	se->avg.period_contrib = cfs_rq->avg.period_contrib;
> +
> +	/*
> +	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
> +	 * period_contrib. This isn't strictly correct, but since we're
> +	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
> +	 * _sum a little.
> +	 */
> +	se->avg.util_sum = se->avg.util_avg * divider;
> +
> +	se->avg.load_sum = divider;
> +	if (se_weight(se)) {
> +		se->avg.load_sum =
> +			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
> +	}

Can scale_load_down(se->load.weight) ever become 0 here?

[...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se
  2017-10-04 19:27   ` Dietmar Eggemann
@ 2017-10-06 13:02     ` Peter Zijlstra
  2017-10-09 12:15       ` Dietmar Eggemann
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-06 13:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, morten.rasmussen, bsegall, yuyang.du

On Wed, Oct 04, 2017 at 08:27:01PM +0100, Dietmar Eggemann wrote:
> On 01/09/17 14:21, Peter Zijlstra wrote:
> > The PELT _sum values are a saw-tooth function, dropping on the decay
> > edge and then growing back up again during the window.
> > 
> > When these window-edges are not aligned between cfs_rq and se, we can
> > have the situation where, for example, on dequeue, the se decays
> > first.
> > 
> > Its _sum values will be small(er), while the cfs_rq _sum values will
> > still be on their way up. Because of this, the subtraction:
> > cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
> > will then, once the cfs_rq reaches an edge, translate into its _avg
> > value jumping up.
> > 
> > This is especially visible with the runnable_load bits, since they get
> > added/subtracted a lot.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++--------------
> >  1 file changed, 31 insertions(+), 14 deletions(-)
> 
> [...]
> 
> > @@ -3644,7 +3634,34 @@ update_cfs_rq_load_avg(u64 now, struct c
> >   */
> >  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
> > +
> > +	/*
> > +	 * When we attach the @se to the @cfs_rq, we must align the decay
> > +	 * window because without that, really weird and wonderful things can
> > +	 * happen.
> > +	 *
> > +	 * XXX illustrate
> > +	 */
> >  	se->avg.last_update_time = cfs_rq->avg.last_update_time;
> > +	se->avg.period_contrib = cfs_rq->avg.period_contrib;
> > +
> > +	/*
> > +	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
> > +	 * period_contrib. This isn't strictly correct, but since we're
> > +	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
> > +	 * _sum a little.
> > +	 */
> > +	se->avg.util_sum = se->avg.util_avg * divider;
> > +
> > +	se->avg.load_sum = divider;
> > +	if (se_weight(se)) {
> > +		se->avg.load_sum =
> > +			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
> > +	}
> 
> Can scale_load_down(se->load.weight) ever become 0 here?

Yeah, don't see why not.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-09-01 13:21 ` [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
@ 2017-10-09  8:08   ` Morten Rasmussen
  2017-10-09  9:45     ` Peter Zijlstra
  2017-10-09 15:03   ` Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Morten Rasmussen @ 2017-10-09  8:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Fri, Sep 01, 2017 at 03:21:11PM +0200, Peter Zijlstra wrote:
> When an entity migrates in (or out) of a runqueue, we need to add (or
> remove) its contribution from the entire PELT hierarchy, because even
> non-runnable entities are included in the load average sums.
> 
> In order to do this we have some propagation logic that updates the
> PELT tree, however the way it 'propagates' the runnable (or load)
> change is (more or less):
> 
>                      tg->weight * grq->avg.load_avg
>   ge->avg.load_avg = ------------------------------
>                                tg->load_avg
> 
> But that is the expression for ge->weight, and per the definition of
> load_avg:
> 
>   ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

You may want to replace "ge->weight" by "ge->load.weight" to be consistent
with the code comments introduced by the patch.

> That destroys the runnable_avg (by setting it to 1) we wanted to
> propagate.
> 
> Instead directly propagate runnable_sum.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/debug.c |    2 
>  kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
>  kernel/sched/sched.h |    9 +-
>  3 files changed, 112 insertions(+), 85 deletions(-)
> 
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
>  			cfs_rq->removed.load_avg);
>  	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>  			cfs_rq->removed.util_avg);
> +	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> +			cfs_rq->removed.runnable_sum);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
>  			cfs_rq->tg_load_avg_contrib);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
>  	se->avg.last_update_time = n_last_update_time;
>  }
>  
> -/* Take into account change of utilization of a child task group */
> +
> +/*
> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> + * propagate its contribution. The key to this propagation is the invariant
> + * that for each group:
> + *
> + *   ge->avg == grq->avg						(1)
> + *
> + * _IFF_ we look at the pure running and runnable sums. Because they
> + * represent the very same entity, just at different points in the hierarchy.
> + *
> + *
> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> + * simply copies the running sum over.
> + *
> + * However, update_tg_cfs_runnable() is more complex. So we have:
> + *
> + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
> + *
> + * And since, like util, the runnable part should be directly transferable,
> + * the following would _appear_ to be the straight forward approach:
> + *
> + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)

Should it be grq->avg.runnable_avg instead of running_avg?

cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to
be:

	load_avg = \Sum se->avg.load_avg
		 = \Sum se->load.weight * se->avg.runnable_avg

That sum will increase when ge is runnable regardless of whether it is
running or not. So, I think it has to be runnable_avg to make sense?

> + *
> + * And per (1) we have:
> + *
> + *   ge->avg.running_avg == grq->avg.running_avg

You just said further up that (1) only applies to running and runnable
sums? These are averages, so I think this is invalid use of (1). But
maybe that is part of your point about (4) being wrong?

I'm still trying to get my head around the remaining bits, but it sort
of depends if I understood the above bits correctly :)

Morten

> + *
> + * Which gives:
> + *
> + *                      ge->load.weight * grq->avg.load_avg
> + *   ge->avg.load_avg = -----------------------------------		(4)
> + *                               grq->load.weight
> + *
> + * Except that is wrong!
> + *
> + * Because while for entities historical weight is not important and we
> + * really only care about our future and therefore can consider a pure
> + * runnable sum, runqueues can NOT do this.
> + *
> + * We specifically want runqueues to have a load_avg that includes
> + * historical weights. Those represent the blocked load, the load we expect
> + * to (shortly) return to us. This only works by keeping the weights as
> + * integral part of the sum. We therefore cannot decompose as per (3).
> + *
> + * OK, so what then?
> + *
> + *
> + * Another way to look at things is:
> + *
> + *   grq->avg.load_avg = \Sum se->avg.load_avg
> + *
> + * Therefore, per (2):
> + *
> + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + *
> + * And the very thing we're propagating is a change in that sum (someone
> + * joined/left). So we can easily know the runnable change, which would be, per
> + * (2) the already tracked se->load_avg divided by the corresponding
> + * se->weight.
> + *
> + * Basically (4) but in differential form:
> + *
> + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> + *								   (5)
> + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> + */
> +

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-09  8:08   ` Morten Rasmussen
@ 2017-10-09  9:45     ` Peter Zijlstra
  2017-10-18 12:45       ` Morten Rasmussen
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-09  9:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Mon, Oct 09, 2017 at 09:08:57AM +0100, Morten Rasmussen wrote:
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
> >  			cfs_rq->removed.load_avg);
> >  	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
> >  			cfs_rq->removed.util_avg);
> > +	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> > +			cfs_rq->removed.runnable_sum);
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
> >  			cfs_rq->tg_load_avg_contrib);
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
> >  	se->avg.last_update_time = n_last_update_time;
> >  }
> >  
> > -/* Take into account change of utilization of a child task group */
> > +
> > +/*
> > + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> > + * propagate its contribution. The key to this propagation is the invariant
> > + * that for each group:
> > + *
> > + *   ge->avg == grq->avg						(1)
> > + *
> > + * _IFF_ we look at the pure running and runnable sums. Because they
> > + * represent the very same entity, just at different points in the hierarchy.
> > + *
> > + *
> > + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> > + * simply copies the running sum over.
> > + *
> > + * However, update_tg_cfs_runnable() is more complex. So we have:
> > + *
> > + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
> > + *
> > + * And since, like util, the runnable part should be directly transferable,
> > + * the following would _appear_ to be the straight forward approach:
> > + *
> > + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
> 
> Should it be grq->avg.runnable_avg instead of running_avg?

Yes very much so. Typing hard. Otherwise (3) would not follow from (2)
either.

> cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to
> be:
> 
> 	load_avg = \Sum se->avg.load_avg
> 		 = \Sum se->load.weight * se->avg.runnable_avg
> 
> That sum will increase when ge is runnable regardless of whether it is
> running or not. So, I think it has to be runnable_avg to make sense?

Ack.

> > + *
> > + * And per (1) we have:
> > + *
> > + *   ge->avg.running_avg == grq->avg.running_avg
> 
> You just said further up that (1) only applies to running and runnable
> sums? These are averages, so I think this is invalid use of (1). But
> maybe that is part of your point about (4) being wrong?
> 
> I'm still trying to get my head around the remaining bits, but it sort
> of depends if I understood the above bits correctly :)

So while true, the thing we're looking for is indeed runnable_avg.

> > + *
> > + * Which gives:
> > + *
> > + *                      ge->load.weight * grq->avg.load_avg
> > + *   ge->avg.load_avg = -----------------------------------		(4)
> > + *                               grq->load.weight
> > + *
> > + * Except that is wrong!
> > + *
> > + * Because while for entities historical weight is not important and we
> > + * really only care about our future and therefore can consider a pure
> > + * runnable sum, runqueues can NOT do this.
> > + *
> > + * We specifically want runqueues to have a load_avg that includes
> > + * historical weights. Those represent the blocked load, the load we expect
> > + * to (shortly) return to us. This only works by keeping the weights as
> > + * integral part of the sum. We therefore cannot decompose as per (3).
> > + *
> > + * OK, so what then?

And as the text above suggests, we cannot decompose because it contains
the blocked weight, which is not included in grq->load.weight and thus
things come apart.

> > + * Another way to look at things is:
> > + *
> > + *   grq->avg.load_avg = \Sum se->avg.load_avg
> > + *
> > + * Therefore, per (2):
> > + *
> > + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> > + *
> > + * And the very thing we're propagating is a change in that sum (someone
> > + * joined/left). So we can easily know the runnable change, which would be, per
> > + * (2) the already tracked se->load_avg divided by the corresponding
> > + * se->weight.
> > + *
> > + * Basically (4) but in differential form:
> > + *
> > + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> > + *								   (5)
> > + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)

And this all has runnable again, and so should make sense.

Combined with an earlier bit, noted by Dietmar, I now have the below
delta.


---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9e520b6923e..ba879c42bddd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,7 +3333,7 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
  * differential update where we store the last value we propagated. This in
  * turn allows skipping updates if the differential is 'small'.
  *
- * Updating tg's load_avg is necessary before update_cfs_share().
+ * Updating tg's load_avg is necessary before update_cfs_group().
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
@@ -3422,11 +3422,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * And since, like util, the runnable part should be directly transferable,
  * the following would _appear_ to be the straight forward approach:
  *
- *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg	(3)
  *
  * And per (1) we have:
  *
- *   ge->avg.running_avg == grq->avg.running_avg
+ *   ge->avg.runnable_avg == grq->avg.runnable_avg
  *
  * Which gives:
  *
@@ -3601,7 +3601,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
  * avg. The immediate corollary is that all (fair) tasks must be attached, see
  * post_init_entity_util_avg().
  *
- * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
+ * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
  *
  * Returns true if the load decayed or we removed load.
  *

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se
  2017-10-06 13:02     ` Peter Zijlstra
@ 2017-10-09 12:15       ` Dietmar Eggemann
  2017-10-09 12:19         ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Dietmar Eggemann @ 2017-10-09 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, morten.rasmussen, bsegall, yuyang.du

On 06/10/17 14:02, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 08:27:01PM +0100, Dietmar Eggemann wrote:
>> On 01/09/17 14:21, Peter Zijlstra wrote:
>>> The PELT _sum values are a saw-tooth function, dropping on the decay
>>> edge and then growing back up again during the window.
>>>
>>> When these window-edges are not aligned between cfs_rq and se, we can
>>> have the situation where, for example, on dequeue, the se decays
>>> first.
>>>
>>> Its _sum values will be small(er), while the cfs_rq _sum values will
>>> still be on their way up. Because of this, the subtraction:
>>> cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
>>> will then, once the cfs_rq reaches an edge, translate into its _avg
>>> value jumping up.
>>>
>>> This is especially visible with the runnable_load bits, since they get
>>> added/subtracted a lot.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> ---
>>>  kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++--------------
>>>  1 file changed, 31 insertions(+), 14 deletions(-)
>>
>> [...]
>>
>>> @@ -3644,7 +3634,34 @@ update_cfs_rq_load_avg(u64 now, struct c
>>>   */
>>>  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>  {
>>> +	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
>>> +
>>> +	/*
>>> +	 * When we attach the @se to the @cfs_rq, we must align the decay
>>> +	 * window because without that, really weird and wonderful things can
>>> +	 * happen.
>>> +	 *
>>> +	 * XXX illustrate
>>> +	 */
>>>  	se->avg.last_update_time = cfs_rq->avg.last_update_time;
>>> +	se->avg.period_contrib = cfs_rq->avg.period_contrib;
>>> +
>>> +	/*
>>> +	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
>>> +	 * period_contrib. This isn't strictly correct, but since we're
>>> +	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
>>> +	 * _sum a little.
>>> +	 */
>>> +	se->avg.util_sum = se->avg.util_avg * divider;
>>> +
>>> +	se->avg.load_sum = divider;
>>> +	if (se_weight(se)) {
>>> +		se->avg.load_sum =
>>> +			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
>>> +	}
>>
>> Can scale_load_down(se->load.weight) ever become 0 here?
> 
> Yeah, don't see why not.

Tasks should be safe since for them scale_load_down(se->load.weight)
should be >= 15 when they get attached.

task groups get attached via attach_entity_cfs_rq()

# mkdir /sys/fs/cgroup/foo

[   63.885333] [<ffff00000811a558>] attach_entity_load_avg+0x138/0x558
[   63.885345] [<ffff0000081208e0>] attach_entity_cfs_rq+0x298/0x998
[   63.885357] [<ffff00000812d660>] online_fair_sched_group+0x70/0xb0
[   63.885369] [<ffff0000081182d4>] sched_online_group+0x94/0xb0
[   63.885381] [<ffff000008118318>] cpu_cgroup_css_online+0x28/0x38
[   63.885393] [<ffff0000081a5e80>] online_css+0x38/0xd0
[   63.885406] [<ffff0000081acae0>] cgroup_apply_control_enable+0x260
[   63.885418] [<ffff0000081afc54>] cgroup_mkdir+0x314/0x4e8

mkdir-2501  [004]    63.689455: bprint:
attach_entity_load_avg: cpu=1 se=0xffff800072fffc00 load.weight=1048576

If I apply the smallest shares possible to that task group, it is
already attched.

# echo 2 > /sys/fs/cgroup/foo/cpu.shares

create_cpu_cgro-2500  [003]    64.591878: bprint:
update_cfs_group: cpu=1 se=0xffff800072fffc00 tg->shares=2048 shares=0
load=0 tg_weight=0
 create_cpu_cgro-2500  [003]    64.591904: bprint:
reweight_entity: cpu=1 se=0xffff800072fffc00 se->load.weight=2

I can't see right now how a task group can get attached with
se->load.weight < 1024 (32 bit)/1048576 (64 bit)? Do I miss something?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se
  2017-10-09 12:15       ` Dietmar Eggemann
@ 2017-10-09 12:19         ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-09 12:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, morten.rasmussen, bsegall, yuyang.du

On Mon, Oct 09, 2017 at 01:15:04PM +0100, Dietmar Eggemann wrote:
> I can't see right now how a task group can get attached with
> se->load.weight < 1024 (32 bit)/1048576 (64 bit)? Do I miss something?

Ah, I think I made a think-o, I think I considered migration for group
entities, but those don't migrate.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-09-01 13:21 ` [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
  2017-10-09  8:08   ` Morten Rasmussen
@ 2017-10-09 15:03   ` Vincent Guittot
  2017-10-09 15:29     ` Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-09 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

Hi Peter,

On 1 September 2017 at 15:21, Peter Zijlstra <peterz@infradead.org> wrote:
> When an entity migrates in (or out) of a runqueue, we need to add (or
> remove) its contribution from the entire PELT hierarchy, because even
> non-runnable entities are included in the load average sums.
>
> In order to do this we have some propagation logic that updates the
> PELT tree, however the way it 'propagates' the runnable (or load)
> change is (more or less):
>
>                      tg->weight * grq->avg.load_avg
>   ge->avg.load_avg = ------------------------------
>                                tg->load_avg
>
> But that is the expression for ge->weight, and per the definition of
> load_avg:
>
>   ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
>
> That destroys the runnable_avg (by setting it to 1) we wanted to
> propagate.
>
> Instead directly propagate runnable_sum.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/debug.c |    2
>  kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
>  kernel/sched/sched.h |    9 +-
>  3 files changed, 112 insertions(+), 85 deletions(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
>                         cfs_rq->removed.load_avg);
>         SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>                         cfs_rq->removed.util_avg);
> +       SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> +                       cfs_rq->removed.runnable_sum);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
>                         cfs_rq->tg_load_avg_contrib);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
>         se->avg.last_update_time = n_last_update_time;
>  }
>
> -/* Take into account change of utilization of a child task group */
> +
> +/*
> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> + * propagate its contribution. The key to this propagation is the invariant
> + * that for each group:
> + *
> + *   ge->avg == grq->avg                                               (1)
> + *
> + * _IFF_ we look at the pure running and runnable sums. Because they
> + * represent the very same entity, just at different points in the hierarchy.

I agree for the running part because only one entity can be running
but i'm not sure for the pure runnable sum because we can have several
runnable task in a cfs_rq but only one runnable group entity to
reflect them
or I misunderstand (1)

As an example, we have 2 always running task TA and TB so their
load_sum is LOAD_AVG_MAX for each task
The grq->avg.load_sum = \Sum se->avg.load_sum = 2*LOAD_AVG_MAX
But
the ge->avg.load_sum will be only LOAD_AVG_MAX

So If we apply directly the d(TB->avg.load_sum) on the group hierachy
and on ge->avg.load_sum in particular, the latter decreases to 0
whereas it should decrease only by half

I have been able to see this wrong behavior with a rt-app json file

so I think that we should instead remove only

delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum

We don't have grq->avg.load_sum but we can have a rough estimate with
grq->avg.load_avg/grq->weight



> + *
> + *
> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> + * simply copies the running sum over.
> + *
> + * However, update_tg_cfs_runnable() is more complex. So we have:
> + *
> + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg         (2)
> + *
> + * And since, like util, the runnable part should be directly transferable,
> + * the following would _appear_ to be the straight forward approach:
> + *
> + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg       (3)
> + *
> + * And per (1) we have:
> + *
> + *   ge->avg.running_avg == grq->avg.running_avg
> + *
> + * Which gives:
> + *
> + *                      ge->load.weight * grq->avg.load_avg
> + *   ge->avg.load_avg = -----------------------------------            (4)
> + *                               grq->load.weight
> + *
> + * Except that is wrong!
> + *
> + * Because while for entities historical weight is not important and we
> + * really only care about our future and therefore can consider a pure
> + * runnable sum, runqueues can NOT do this.
> + *
> + * We specifically want runqueues to have a load_avg that includes
> + * historical weights. Those represent the blocked load, the load we expect
> + * to (shortly) return to us. This only works by keeping the weights as
> + * integral part of the sum. We therefore cannot decompose as per (3).
> + *
> + * OK, so what then?
> + *
> + *
> + * Another way to look at things is:
> + *
> + *   grq->avg.load_avg = \Sum se->avg.load_avg
> + *
> + * Therefore, per (2):
> + *
> + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + *
> + * And the very thing we're propagating is a change in that sum (someone
> + * joined/left). So we can easily know the runnable change, which would be, per
> + * (2) the already tracked se->load_avg divided by the corresponding
> + * se->weight.
> + *
> + * Basically (4) but in differential form:
> + *
> + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> + *                                                                (5)
> + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> + */
> +

[snip]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-09 15:03   ` Vincent Guittot
@ 2017-10-09 15:29     ` Vincent Guittot
  2017-10-10  7:29       ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-09 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> Hi Peter,
>
> On 1 September 2017 at 15:21, Peter Zijlstra <peterz@infradead.org> wrote:
>> When an entity migrates in (or out) of a runqueue, we need to add (or
>> remove) its contribution from the entire PELT hierarchy, because even
>> non-runnable entities are included in the load average sums.
>>
>> In order to do this we have some propagation logic that updates the
>> PELT tree, however the way it 'propagates' the runnable (or load)
>> change is (more or less):
>>
>>                      tg->weight * grq->avg.load_avg
>>   ge->avg.load_avg = ------------------------------
>>                                tg->load_avg
>>
>> But that is the expression for ge->weight, and per the definition of
>> load_avg:
>>
>>   ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
>>
>> That destroys the runnable_avg (by setting it to 1) we wanted to
>> propagate.
>>
>> Instead directly propagate runnable_sum.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>  kernel/sched/debug.c |    2
>>  kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
>>  kernel/sched/sched.h |    9 +-
>>  3 files changed, 112 insertions(+), 85 deletions(-)
>>
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
>>                         cfs_rq->removed.load_avg);
>>         SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>>                         cfs_rq->removed.util_avg);
>> +       SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
>> +                       cfs_rq->removed.runnable_sum);
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>         SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
>>                         cfs_rq->tg_load_avg_contrib);
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
>>         se->avg.last_update_time = n_last_update_time;
>>  }
>>
>> -/* Take into account change of utilization of a child task group */
>> +
>> +/*
>> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
>> + * propagate its contribution. The key to this propagation is the invariant
>> + * that for each group:
>> + *
>> + *   ge->avg == grq->avg                                               (1)
>> + *
>> + * _IFF_ we look at the pure running and runnable sums. Because they
>> + * represent the very same entity, just at different points in the hierarchy.
>
> I agree for the running part because only one entity can be running
> but i'm not sure for the pure runnable sum because we can have several
> runnable task in a cfs_rq but only one runnable group entity to
> reflect them
> or I misunderstand (1)
>
> As an example, we have 2 always running task TA and TB so their
> load_sum is LOAD_AVG_MAX for each task
> The grq->avg.load_sum = \Sum se->avg.load_sum = 2*LOAD_AVG_MAX
> But
> the ge->avg.load_sum will be only LOAD_AVG_MAX
>
> So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> and on ge->avg.load_sum in particular, the latter decreases to 0
> whereas it should decrease only by half
>
> I have been able to see this wrong behavior with a rt-app json file
>
> so I think that we should instead remove only
>
> delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum

delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
ge->avg.load_sum

as the se has already been detached

> We don't have grq->avg.load_sum but we can have a rough estimate with
> grq->avg.load_avg/grq->weight
>
>
>
>> + *
>> + *
>> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
>> + * simply copies the running sum over.
>> + *
>> + * However, update_tg_cfs_runnable() is more complex. So we have:
>> + *
>> + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg         (2)
>> + *
>> + * And since, like util, the runnable part should be directly transferable,
>> + * the following would _appear_ to be the straight forward approach:
>> + *
>> + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg       (3)
>> + *
>> + * And per (1) we have:
>> + *
>> + *   ge->avg.running_avg == grq->avg.running_avg
>> + *
>> + * Which gives:
>> + *
>> + *                      ge->load.weight * grq->avg.load_avg
>> + *   ge->avg.load_avg = -----------------------------------            (4)
>> + *                               grq->load.weight
>> + *
>> + * Except that is wrong!
>> + *
>> + * Because while for entities historical weight is not important and we
>> + * really only care about our future and therefore can consider a pure
>> + * runnable sum, runqueues can NOT do this.
>> + *
>> + * We specifically want runqueues to have a load_avg that includes
>> + * historical weights. Those represent the blocked load, the load we expect
>> + * to (shortly) return to us. This only works by keeping the weights as
>> + * integral part of the sum. We therefore cannot decompose as per (3).
>> + *
>> + * OK, so what then?
>> + *
>> + *
>> + * Another way to look at things is:
>> + *
>> + *   grq->avg.load_avg = \Sum se->avg.load_avg
>> + *
>> + * Therefore, per (2):
>> + *
>> + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
>> + *
>> + * And the very thing we're propagating is a change in that sum (someone
>> + * joined/left). So we can easily know the runnable change, which would be, per
>> + * (2) the already tracked se->load_avg divided by the corresponding
>> + * se->weight.
>> + *
>> + * Basically (4) but in differential form:
>> + *
>> + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
>> + *                                                                (5)
>> + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
>> + */
>> +
>
> [snip]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-09 15:29     ` Vincent Guittot
@ 2017-10-10  7:29       ` Peter Zijlstra
  2017-10-10  7:44         ` Vincent Guittot
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-10  7:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote:
> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > On 1 September 2017 at 15:21, Peter Zijlstra <peterz@infradead.org> wrote:

> >> +/*
> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> >> + * propagate its contribution. The key to this propagation is the invariant
> >> + * that for each group:
> >> + *
> >> + *   ge->avg == grq->avg                                               (1)
> >> + *
> >> + * _IFF_ we look at the pure running and runnable sums. Because they
> >> + * represent the very same entity, just at different points in the hierarchy.
> >
> > I agree for the running part because only one entity can be running
> > but i'm not sure for the pure runnable sum because we can have
> > several runnable task in a cfs_rq but only one runnable group entity
> > to reflect them or I misunderstand (1)

The idea is that they (ge and grq) are the _same_ entity, just at
different levels in the hierarchy. If the grq is runnable, it is through
the ge.

As a whole, they don't care how many runnable tasks there are.

> > As an example, we have 2 always running task TA and TB so their
> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum
> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be
> > only LOAD_AVG_MAX
> >
> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> > and on ge->avg.load_sum in particular, the latter decreases to 0
> > whereas it should decrease only by half
> >
> > I have been able to see this wrong behavior with a rt-app json file
> >
> > so I think that we should instead remove only
> >
> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
> 
> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
> ge->avg.load_sum
> 
> as the se has already been detached
> 
> > We don't have grq->avg.load_sum but we can have a rough estimate with
> > grq->avg.load_avg/grq->weight

Hurm, I think I see what you're saying, let me ponder this more.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-10  7:29       ` Peter Zijlstra
@ 2017-10-10  7:44         ` Vincent Guittot
  2017-10-13 15:22           ` Vincent Guittot
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-10  7:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On 10 October 2017 at 09:29, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote:
>> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> > On 1 September 2017 at 15:21, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> >> +/*
>> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
>> >> + * propagate its contribution. The key to this propagation is the invariant
>> >> + * that for each group:
>> >> + *
>> >> + *   ge->avg == grq->avg                                               (1)
>> >> + *
>> >> + * _IFF_ we look at the pure running and runnable sums. Because they
>> >> + * represent the very same entity, just at different points in the hierarchy.
>> >
>> > I agree for the running part because only one entity can be running
>> > but i'm not sure for the pure runnable sum because we can have
>> > several runnable task in a cfs_rq but only one runnable group entity
>> > to reflect them or I misunderstand (1)
>
> The idea is that they (ge and grq) are the _same_ entity, just at
> different levels in the hierarchy. If the grq is runnable, it is through
> the ge.
>
> As a whole, they don't care how many runnable tasks there are.

Ok so I agree with this point that both grp and ge runnable follow the
same trend

my point is that even if same changes apply on both grp and ge, we
can't directly apply changes of grp's runnable on ge's runnable

>
>> > As an example, we have 2 always running task TA and TB so their
>> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum
>> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be
>> > only LOAD_AVG_MAX
>> >
>> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy
>> > and on ge->avg.load_sum in particular, the latter decreases to 0
>> > whereas it should decrease only by half
>> >
>> > I have been able to see this wrong behavior with a rt-app json file
>> >
>> > so I think that we should instead remove only
>> >
>> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
>>
>> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
>> ge->avg.load_sum
>>
>> as the se has already been detached
>>
>> > We don't have grq->avg.load_sum but we can have a rough estimate with
>> > grq->avg.load_avg/grq->weight
>
> Hurm, I think I see what you're saying, let me ponder this more.

The formula above works was an example for detaching but it doesn't
apply for all use case. we need a more generic way to propagate
runnable changes

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-10  7:44         ` Vincent Guittot
@ 2017-10-13 15:22           ` Vincent Guittot
  2017-10-13 20:41             ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-13 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

Hi Peter,

Le Tuesday 10 Oct 2017 à 09:44:53 (+0200), Vincent Guittot a écrit :
> On 10 October 2017 at 09:29, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote:
> >> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >> > On 1 September 2017 at 15:21, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> >> +/*
> >> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> >> >> + * propagate its contribution. The key to this propagation is the invariant
> >> >> + * that for each group:
> >> >> + *
> >> >> + *   ge->avg == grq->avg                                               (1)
> >> >> + *
> >> >> + * _IFF_ we look at the pure running and runnable sums. Because they
> >> >> + * represent the very same entity, just at different points in the hierarchy.
> >> >
> >> > I agree for the running part because only one entity can be running
> >> > but i'm not sure for the pure runnable sum because we can have
> >> > several runnable task in a cfs_rq but only one runnable group entity
> >> > to reflect them or I misunderstand (1)
> >
> > The idea is that they (ge and grq) are the _same_ entity, just at
> > different levels in the hierarchy. If the grq is runnable, it is through
> > the ge.
> >
> > As a whole, they don't care how many runnable tasks there are.
> 
> Ok so I agree with this point that both grp and ge runnable follow the
> same trend
> 
> my point is that even if same changes apply on both grp and ge, we
> can't directly apply changes of grp's runnable on ge's runnable
> 
> >
> >> > As an example, we have 2 always running task TA and TB so their
> >> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum
> >> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be
> >> > only LOAD_AVG_MAX
> >> >
> >> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> >> > and on ge->avg.load_sum in particular, the latter decreases to 0
> >> > whereas it should decrease only by half
> >> >
> >> > I have been able to see this wrong behavior with a rt-app json file
> >> >
> >> > so I think that we should instead remove only
> >> >
> >> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
> >>
> >> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
> >> ge->avg.load_sum
> >>
> >> as the se has already been detached
> >>
> >> > We don't have grq->avg.load_sum but we can have a rough estimate with
> >> > grq->avg.load_avg/grq->weight
> >
> > Hurm, I think I see what you're saying, let me ponder this more.
> 
> The formula above works was an example for detaching but it doesn't
> apply for all use case. we need a more generic way to propagate
> runnable changes

I have studied a bit more how to improve the propagation formula and the
changes below is doing the job for the UCs that I have tested.

Unlike running, we can't directly propagate the runnable through hierarchy
when we migrate a task. Instead we must ensure that we will not
over/underestimate the impact of the migration thanks to several rules:
-ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
-ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
the same range)
-we can't directly propagate a negative delta of runnable_sum because part of
this runnable time can be "shared" with others sched_entities and stays on the
gcfs_rq. Instead, we can't estimate the new runnable_sum of the gcfs_rq with
the formula: gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.
-ge->avg.runnable_sum can't increase when we detach a task.

---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec0..a063b048 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	long runnable_load_avg, delta_avg, load_avg;
+	s64 runnable_load_sum, delta_sum, load_sum = 0;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	/*
+	 * Get a rough estimate of gcfs_rq's runnable
+	 * This is a low guess as it assumes that tasks are equally
+	 * runnable which is not true but we can't really do better
+	 */
+	if (scale_load_down(gcfs_rq->load.weight))
+		load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+
+	/*
+	 * Propating a delta of runnable is not just adding it to ge's
+	 * runnable_sum:
+	 * - Adding a delta runnable can't make ge's runnable_sum higher than
+	 *   LOAD_AVG_MAX
+	 * - We can't directly remove a delta of runnable from
+	 *   ge's runnable_sum but we can only guest estimate what runnable
+	 *   will become thanks to few simple rules:
+	 *   - gcfs_rq's runnable is a good estimate
+	 *   - ge's runnable_sum can't increase when we remove runnable
+	 *   - runnable_sum can't be lower than running_sum
+	 */
+	if (runnable_sum >= 0) {
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
+	} else
+		runnable_sum = min(se->avg.load_sum, load_sum);
+
+	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-13 15:22           ` Vincent Guittot
@ 2017-10-13 20:41             ` Peter Zijlstra
  2017-10-15 12:01               ` Vincent Guittot
  2017-10-16 13:55               ` Vincent Guittot
  0 siblings, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-13 20:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On Fri, Oct 13, 2017 at 05:22:54PM +0200, Vincent Guittot wrote:
> 
> I have studied a bit more how to improve the propagation formula and the
> changes below is doing the job for the UCs that I have tested.
> 
> Unlike running, we can't directly propagate the runnable through hierarchy
> when we migrate a task. Instead we must ensure that we will not
> over/underestimate the impact of the migration thanks to several rules:
>  - ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
>  - ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
>    the same range)
>  - we can't directly propagate a negative delta of runnable_sum because part of
>    this runnable time can be "shared" with others sched_entities and stays on the
>    gcfs_rq.

Right, that's about how far I got.

>  - ge->avg.runnable_sum can't increase when we detach a task.

Yeah, that would be fairly broken.

> Instead, we can't estimate the new runnable_sum of the gcfs_rq with

 s/can't/can/ ?

> the formula:
>
>   gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.

That might be the best we can do.. its wrong, but then its less wrong
that what we have now. The comments can be much improved though. Not to
mention that the big comment on top needs a little help.

> ---
>  kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 45 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 350dbec0..a063b048 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
>  static inline void
>  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
>  {
> +	long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
> +	long runnable_load_avg, delta_avg, load_avg;
> +	s64 runnable_load_sum, delta_sum, load_sum = 0;
>  
>  	if (!runnable_sum)
>  		return;
>  
>  	gcfs_rq->prop_runnable_sum = 0;
>  
> +	/*
> +	 * Get a rough estimate of gcfs_rq's runnable
> +	 * This is a low guess as it assumes that tasks are equally
> +	 * runnable which is not true but we can't really do better
> +	 */
> +	if (scale_load_down(gcfs_rq->load.weight)) {
> +		load_sum = div_s64(gcfs_rq->avg.load_sum,
> +				scale_load_down(gcfs_rq->load.weight));
	}
> +
> +	/*
> +	 * Propating a delta of runnable is not just adding it to ge's
> +	 * runnable_sum:
> +	 * - Adding a delta runnable can't make ge's runnable_sum higher than
> +	 *   LOAD_AVG_MAX
> +	 * - We can't directly remove a delta of runnable from
> +	 *   ge's runnable_sum but we can only guest estimate what runnable
> +	 *   will become thanks to few simple rules:
> +	 *   - gcfs_rq's runnable is a good estimate
> +	 *   - ge's runnable_sum can't increase when we remove runnable
> +	 *   - runnable_sum can't be lower than running_sum
> +	 */
> +	if (runnable_sum >= 0) {
> +		runnable_sum += se->avg.load_sum;
> +		runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
> +	} else {
> +		runnable_sum = min(se->avg.load_sum, load_sum);
	}
> +
> +	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
> +	runnable_sum = max(runnable_sum, running_sum);
> +
>  	load_sum = (s64)se_weight(se) * runnable_sum;
>  	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
>  
> +	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
> +	delta_avg = load_avg - se->avg.load_avg;
>  
> +	se->avg.load_sum = runnable_sum;
> +	se->avg.load_avg = load_avg;
> +	add_positive(&cfs_rq->avg.load_avg, delta_avg);
> +	add_positive(&cfs_rq->avg.load_sum, delta_sum);
>  
>  	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
>  	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
> +	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
> +	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
>  
> +	se->avg.runnable_load_sum = runnable_sum;
> +	se->avg.runnable_load_avg = runnable_load_avg;
>  
>  	if (se->on_rq) {
> +		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
> +		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
>  	}
>  }
>  
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-13 20:41             ` Peter Zijlstra
@ 2017-10-15 12:01               ` Vincent Guittot
  2017-10-16 13:55               ` Vincent Guittot
  1 sibling, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2017-10-15 12:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On 13 October 2017 at 22:41, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 13, 2017 at 05:22:54PM +0200, Vincent Guittot wrote:
>>
>> I have studied a bit more how to improve the propagation formula and the
>> changes below is doing the job for the UCs that I have tested.
>>
>> Unlike running, we can't directly propagate the runnable through hierarchy
>> when we migrate a task. Instead we must ensure that we will not
>> over/underestimate the impact of the migration thanks to several rules:
>>  - ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
>>  - ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
>>    the same range)
>>  - we can't directly propagate a negative delta of runnable_sum because part of
>>    this runnable time can be "shared" with others sched_entities and stays on the
>>    gcfs_rq.
>
> Right, that's about how far I got.
>
>>  - ge->avg.runnable_sum can't increase when we detach a task.
>
> Yeah, that would be fairly broken.
>
>> Instead, we can't estimate the new runnable_sum of the gcfs_rq with
>
>  s/can't/can/ ?

yes it's can

>
>> the formula:
>>
>>   gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.
>
> That might be the best we can do.. its wrong, but then its less wrong
> that what we have now. The comments can be much improved though. Not to
> mention that the big comment on top needs a little help.

I'm going to update the comment

>
>> ---
>>  kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 45 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 350dbec0..a063b048 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
>>  static inline void
>>  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
>>  {
>> +     long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
>> +     long runnable_load_avg, delta_avg, load_avg;
>> +     s64 runnable_load_sum, delta_sum, load_sum = 0;
>>
>>       if (!runnable_sum)
>>               return;
>>
>>       gcfs_rq->prop_runnable_sum = 0;
>>
>> +     /*
>> +      * Get a rough estimate of gcfs_rq's runnable
>> +      * This is a low guess as it assumes that tasks are equally
>> +      * runnable which is not true but we can't really do better
>> +      */
>> +     if (scale_load_down(gcfs_rq->load.weight)) {
>> +             load_sum = div_s64(gcfs_rq->avg.load_sum,
>> +                             scale_load_down(gcfs_rq->load.weight));
>         }
>> +
>> +     /*
>> +      * Propating a delta of runnable is not just adding it to ge's
>> +      * runnable_sum:
>> +      * - Adding a delta runnable can't make ge's runnable_sum higher than
>> +      *   LOAD_AVG_MAX
>> +      * - We can't directly remove a delta of runnable from
>> +      *   ge's runnable_sum but we can only guest estimate what runnable
>> +      *   will become thanks to few simple rules:
>> +      *   - gcfs_rq's runnable is a good estimate
>> +      *   - ge's runnable_sum can't increase when we remove runnable
>> +      *   - runnable_sum can't be lower than running_sum
>> +      */
>> +     if (runnable_sum >= 0) {
>> +             runnable_sum += se->avg.load_sum;
>> +             runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
>> +     } else {
>> +             runnable_sum = min(se->avg.load_sum, load_sum);
>         }
>> +
>> +     running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
>> +     runnable_sum = max(runnable_sum, running_sum);
>> +
>>       load_sum = (s64)se_weight(se) * runnable_sum;
>>       load_avg = div_s64(load_sum, LOAD_AVG_MAX);
>>
>> +     delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
>> +     delta_avg = load_avg - se->avg.load_avg;
>>
>> +     se->avg.load_sum = runnable_sum;
>> +     se->avg.load_avg = load_avg;
>> +     add_positive(&cfs_rq->avg.load_avg, delta_avg);
>> +     add_positive(&cfs_rq->avg.load_sum, delta_sum);
>>
>>       runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
>>       runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
>> +     delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
>> +     delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
>>
>> +     se->avg.runnable_load_sum = runnable_sum;
>> +     se->avg.runnable_load_avg = runnable_load_avg;
>>
>>       if (se->on_rq) {
>> +             add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
>> +             add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
>>       }
>>  }
>>
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-13 20:41             ` Peter Zijlstra
  2017-10-15 12:01               ` Vincent Guittot
@ 2017-10-16 13:55               ` Vincent Guittot
  2017-10-19 15:04                 ` Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-16 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

Hi Peter,

Le Friday 13 Oct 2017 à 22:41:11 (+0200), Peter Zijlstra a écrit :
> On Fri, Oct 13, 2017 at 05:22:54PM +0200, Vincent Guittot wrote:
> > 
> > I have studied a bit more how to improve the propagation formula and the
> > changes below is doing the job for the UCs that I have tested.
> > 
> > Unlike running, we can't directly propagate the runnable through hierarchy
> > when we migrate a task. Instead we must ensure that we will not
> > over/underestimate the impact of the migration thanks to several rules:
> >  - ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
> >  - ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
> >    the same range)
> >  - we can't directly propagate a negative delta of runnable_sum because part of
> >    this runnable time can be "shared" with others sched_entities and stays on the
> >    gcfs_rq.
> 
> Right, that's about how far I got.
> 
> >  - ge->avg.runnable_sum can't increase when we detach a task.
> 
> Yeah, that would be fairly broken.
>
> 
> > Instead, we can't estimate the new runnable_sum of the gcfs_rq with
> 
>  s/can't/can/ ?
> 
> > the formula:
> >
> >   gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.
> 
> That might be the best we can do.. its wrong, but then its less wrong
> that what we have now. The comments can be much improved though. Not to
> mention that the big comment on top needs a little help.

Subject: [PATCH] sched: Update runnable propagation rule

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and this
runnable time will also remain on prev cfs_rq and must not be removed.
Instead, we can estimate what should be the new runnable of the prev cfs_rq
and check that this estimation stay in a possible range.
The prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula below
instead:
  
  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight (1)

(1) assumes that tasks are equally runnable which is not true but easy to
compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:
-ge->avg.runnable_sum <= than LOAD_AVG_MAX
-ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
-ge->avg.runnable_sum can't increase when we detach a task

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 45 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec0..08d2a58 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3489,33 +3489,56 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	long runnable_load_avg, delta_avg, load_avg;
+	s64 runnable_load_sum, delta_sum, load_sum = 0;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/* Get a rough estimate of the new gcfs_rq's runnable */
+		runnable_sum += se->avg.load_sum;
+		/* ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX */
+		runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
+	} else {
+		/* Get a rough estimate of the new gcfs_rq's runnable */
+		if (scale_load_down(gcfs_rq->load.weight))
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+
+		/* ge->avg.runnable_sum can't increase when removing runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/* runnable_sum can't be lower than running_sum */
+	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-09  9:45     ` Peter Zijlstra
@ 2017-10-18 12:45       ` Morten Rasmussen
  2017-10-30 13:35         ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Rasmussen @ 2017-10-18 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Mon, Oct 09, 2017 at 11:45:17AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 09, 2017 at 09:08:57AM +0100, Morten Rasmussen wrote:
> > > --- a/kernel/sched/debug.c
> > > +++ b/kernel/sched/debug.c
> > > @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
> > >  			cfs_rq->removed.load_avg);
> > >  	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
> > >  			cfs_rq->removed.util_avg);
> > > +	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> > > +			cfs_rq->removed.runnable_sum);
> > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >  	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
> > >  			cfs_rq->tg_load_avg_contrib);
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
> > >  	se->avg.last_update_time = n_last_update_time;
> > >  }
> > >  
> > > -/* Take into account change of utilization of a child task group */
> > > +
> > > +/*
> > > + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> > > + * propagate its contribution. The key to this propagation is the invariant
> > > + * that for each group:
> > > + *
> > > + *   ge->avg == grq->avg						(1)
> > > + *
> > > + * _IFF_ we look at the pure running and runnable sums. Because they
> > > + * represent the very same entity, just at different points in the hierarchy.
> > > + *
> > > + *
> > > + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> > > + * simply copies the running sum over.
> > > + *
> > > + * However, update_tg_cfs_runnable() is more complex. So we have:
> > > + *
> > > + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
> > > + *
> > > + * And since, like util, the runnable part should be directly transferable,
> > > + * the following would _appear_ to be the straight forward approach:
> > > + *
> > > + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
> > 
> > Should it be grq->avg.runnable_avg instead of running_avg?
> 
> Yes very much so. Typing hard. Otherwise (3) would not follow from (2)
> either.
> 
> > cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to
> > be:
> > 
> > 	load_avg = \Sum se->avg.load_avg
> > 		 = \Sum se->load.weight * se->avg.runnable_avg
> > 
> > That sum will increase when ge is runnable regardless of whether it is
> > running or not. So, I think it has to be runnable_avg to make sense?
> 
> Ack.
> 
> > > + *
> > > + * And per (1) we have:
> > > + *
> > > + *   ge->avg.running_avg == grq->avg.running_avg
> > 
> > You just said further up that (1) only applies to running and runnable
> > sums? These are averages, so I think this is invalid use of (1). But
> > maybe that is part of your point about (4) being wrong?
> > 
> > I'm still trying to get my head around the remaining bits, but it sort
> > of depends if I understood the above bits correctly :)
> 
> So while true, the thing we're looking for is indeed runnable_avg.
> 
> > > + *
> > > + * Which gives:
> > > + *
> > > + *                      ge->load.weight * grq->avg.load_avg
> > > + *   ge->avg.load_avg = -----------------------------------		(4)
> > > + *                               grq->load.weight
> > > + *
> > > + * Except that is wrong!
> > > + *
> > > + * Because while for entities historical weight is not important and we
> > > + * really only care about our future and therefore can consider a pure
> > > + * runnable sum, runqueues can NOT do this.
> > > + *
> > > + * We specifically want runqueues to have a load_avg that includes
> > > + * historical weights. Those represent the blocked load, the load we expect
> > > + * to (shortly) return to us. This only works by keeping the weights as
> > > + * integral part of the sum. We therefore cannot decompose as per (3).
> > > + *
> > > + * OK, so what then?
> 
> And as the text above suggests, we cannot decompose because it contains
> the blocked weight, which is not included in grq->load.weight and thus
> things come apart.
> 
> > > + * Another way to look at things is:
> > > + *
> > > + *   grq->avg.load_avg = \Sum se->avg.load_avg
> > > + *
> > > + * Therefore, per (2):
> > > + *
> > > + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> > > + *
> > > + * And the very thing we're propagating is a change in that sum (someone
> > > + * joined/left). So we can easily know the runnable change, which would be, per
> > > + * (2) the already tracked se->load_avg divided by the corresponding
> > > + * se->weight.
> > > + *
> > > + * Basically (4) but in differential form:
> > > + *
> > > + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> > > + *								   (5)
> > > + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> 
> And this all has runnable again, and so should make sense.

I'm afraid I don't quite get why (5) is correct. It might be related to
the issues Vincent already pointed out.

d(runnable_avg) is the runnable_avg series for the joining/leaving se
which is contributing to grq->avg.load_avg, but I don't see how you can
use that to compute the impact on ge->avg.load_avg.

	ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)

In (5) you have just substituted ge->avg.runnable_avg with
d(runnable_avg) in (2). However, the relationship between
ge->avg.runnable_avg and se->avg.runnable_avg is complicated. ge is
runnable whenever se is, but the reverse isn't necessarily true. Let's
say you have two always-runnable tasks on your grq and one of the leaves
(migrates away). In that case, ge->avg.runnable_avg is equal to
se->avg.runnable_avg (both always-runnable) which is d(runnable_avg), so
in (5) we end up with:

	ge->avg.load_avg =	ge->load.weight * ge->avg.runnable_avg
			      - ge->load.weight * se->avg.runnable_avg
			 = 0

But you still have one always-running task on the grq so clearly it
shouldn't be zero.

IOW, AFAICT, it is not possible to decompose ge->avg.runnable_avg into
contributions from each individual se on the grq. At least not without
some additional assumptions.

What am I missing?

Morten

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-16 13:55               ` Vincent Guittot
@ 2017-10-19 15:04                 ` Vincent Guittot
  2017-10-30 17:20                   ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-19 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

Subject: [PATCH v2] sched: Update runnable propagation rule

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and this
runnable time will also remain on prev cfs_rq and must not be removed.
Instead, we can estimate what should be the new runnable of the prev cfs_rq
and check that this estimation stay in a possible range.
The prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula below
instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight (1)

(1) assumes that tasks are equally runnable which is not true but easy to
compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:
-ge->avg.runnable_sum <= than LOAD_AVG_MAX
-ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
-ge->avg.runnable_sum can't increase when we detach a task

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

v2:
- Fixed some type mismatch warnings

 kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56f343b..aea89df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3489,33 +3489,57 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	unsigned long runnable_load_avg, load_avg;
+	s64 delta_sum;
+	u64 runnable_load_sum, load_sum = 0;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/* Get a rough estimate of the new gcfs_rq's runnable */
+		runnable_sum += se->avg.load_sum;
+		/* ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX */
+		runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
+	} else {
+		/* Get a rough estimate of the new gcfs_rq's runnable */
+		if (scale_load_down(gcfs_rq->load.weight))
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+
+		/* ge->avg.runnable_sum can't increase when removing runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/* runnable_sum can't be lower than running_sum */
+	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-18 12:45       ` Morten Rasmussen
@ 2017-10-30 13:35         ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-30 13:35 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, linux-kernel, tj, josef, torvalds, vincent.guittot,
	efault, pjt, clm, dietmar.eggemann, bsegall, yuyang.du

On Wed, Oct 18, 2017 at 01:45:25PM +0100, Morten Rasmussen wrote:
> > > > + * Basically (4) but in differential form:
> > > > + *
> > > > + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> > > > + *								   (5)
> > > > + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> > 
> > And this all has runnable again, and so should make sense.
> 
> I'm afraid I don't quite get why (5) is correct. It might be related to
> the issues Vincent already pointed out.

Yeah; I think so... Let me stare at his latest -- which still doesn't
update this comment.. :/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-19 15:04                 ` Vincent Guittot
@ 2017-10-30 17:20                   ` Peter Zijlstra
  2017-10-31 11:14                     ` Vincent Guittot
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-30 17:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

So after a bit of poking I ended up with something like the below; I
think there's still a few open points, see XXX. But its better than we
have now.

Josef, could you see if this completely wrecks your workloads?

---
Subject: sched: Update runnable propagation rule
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu, 19 Oct 2017 17:04:42 +0200

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
---

 kernel/sched/fair.c |   99 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 70 insertions(+), 29 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3412,9 +3412,9 @@ void set_task_rq_fair(struct sched_entit
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- *
- * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
- * simply copies the running sum over.
+ * Per the above update_tg_cfs_util() is trivial and simply copies the running
+ * sum over (but still wrong, because the group entity and group rq do not have
+ * their PELT windows aligned).
  *
  * However, update_tg_cfs_runnable() is more complex. So we have:
  *
@@ -3423,11 +3423,11 @@ void set_task_rq_fair(struct sched_entit
  * And since, like util, the runnable part should be directly transferable,
  * the following would _appear_ to be the straight forward approach:
  *
- *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg	(3)
  *
  * And per (1) we have:
  *
- *   ge->avg.running_avg == grq->avg.running_avg
+ *   ge->avg.runnable_avg == grq->avg.runnable_avg
  *
  * Which gives:
  *
@@ -3446,27 +3446,28 @@ void set_task_rq_fair(struct sched_entit
  * to (shortly) return to us. This only works by keeping the weights as
  * integral part of the sum. We therefore cannot decompose as per (3).
  *
- * OK, so what then?
+ * Another reason this doesn't work is that runnable isn't a 0-sum entity.
+ * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
+ * rq itself is runnable anywhere between 2/3 and 1 depending on how the
+ * runnable section of these tasks overlap (or not). If they were to perfectly
+ * align the rq as a whole would be runnable 2/3 of the time. If however we
+ * always have at least 1 runnable task, the rq as a whole is always runnable.
  *
+ * So we'll have to approximate.. :/
  *
- * Another way to look at things is:
+ * Given the constraint:
  *
- *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
  *
- * Therefore, per (2):
+ * We can construct a rule that adds runnable to a rq by assuming minimal
+ * overlap.
  *
- *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ * On removal, we'll assume each task is equally runnable; which yields:
  *
- * And the very thing we're propagating is a change in that sum (someone
- * joined/left). So we can easily know the runnable change, which would be, per
- * (2) the already tracked se->load_avg divided by the corresponding
- * se->weight.
+ *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
  *
- * Basically (4) but in differential form:
+ * XXX: only do this for the part of runnable > running ?
  *
- *   d(runnable_avg) += se->avg.load_avg / se->load.weight
- *								   (5)
- *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
  */
 
 static inline void
@@ -3478,6 +3479,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 	if (!delta)
 		return;
 
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
@@ -3490,33 +3499,65 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	unsigned long runnable_load_avg, load_avg;
+	u64 runnable_load_sum, load_sum = 0;
+	s64 delta_sum;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/*
+		 * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
+		 * the CPU is saturated running == runnable.
+		 */
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
+	} else {
+		/*
+		 * Estimate the departing task's runnable by assuming all tasks
+		 * are equally runnable.
+		 *
+		 * XXX: doesn't deal with multiple departures?
+		 */
+		if (scale_load_down(gcfs_rq->load.weight)) {
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+		}
+
+		/* But make sure to not inflate se's runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/* runnable_sum can't be lower than running_sum */
+	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; /* XXX ? */
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-30 17:20                   ` Peter Zijlstra
@ 2017-10-31 11:14                     ` Vincent Guittot
  2017-10-31 15:01                       ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-10-31 11:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On 30 October 2017 at 18:20, Peter Zijlstra <peterz@infradead.org> wrote:
> So after a bit of poking I ended up with something like the below; I
> think there's still a few open points, see XXX. But its better than we
> have now.
>
> Josef, could you see if this completely wrecks your workloads?
>
> ---
> Subject: sched: Update runnable propagation rule
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Date: Thu, 19 Oct 2017 17:04:42 +0200
>
> Unlike running, the runnable part can't be directly propagated through
> the hierarchy when we migrate a task. The main reason is that runnable
> time can be shared with other sched_entities that stay on the rq and
> this runnable time will also remain on prev cfs_rq and must not be
> removed.
>
> Instead, we can estimate what should be the new runnable of the prev
> cfs_rq and check that this estimation stay in a possible range. The
> prop_runnable_sum is a good estimation when adding runnable_sum but
> fails most often when we remove it. Instead, we could use the formula
> below instead:
>
>   gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
>
> which assumes that tasks are equally runnable which is not true but
> easy to compute.
>
> Beside these estimates, we have several simple rules that help us to filter
> out wrong ones:
>
>  - ge->avg.runnable_sum <= than LOAD_AVG_MAX
>  - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
>  - ge->avg.runnable_sum can't increase when we detach a task
>
> Cc: Yuyang Du <yuyang.du@intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Chris Mason <clm@fb.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
> ---
>
>  kernel/sched/fair.c |   99 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 70 insertions(+), 29 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3412,9 +3412,9 @@ void set_task_rq_fair(struct sched_entit
>   * _IFF_ we look at the pure running and runnable sums. Because they
>   * represent the very same entity, just at different points in the hierarchy.
>   *
> - *
> - * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> - * simply copies the running sum over.
> + * Per the above update_tg_cfs_util() is trivial and simply copies the running
> + * sum over (but still wrong, because the group entity and group rq do not have
> + * their PELT windows aligned).
>   *
>   * However, update_tg_cfs_runnable() is more complex. So we have:
>   *
> @@ -3423,11 +3423,11 @@ void set_task_rq_fair(struct sched_entit
>   * And since, like util, the runnable part should be directly transferable,
>   * the following would _appear_ to be the straight forward approach:
>   *
> - *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg       (3)
> + *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg      (3)
>   *
>   * And per (1) we have:
>   *
> - *   ge->avg.running_avg == grq->avg.running_avg
> + *   ge->avg.runnable_avg == grq->avg.runnable_avg
>   *
>   * Which gives:
>   *
> @@ -3446,27 +3446,28 @@ void set_task_rq_fair(struct sched_entit
>   * to (shortly) return to us. This only works by keeping the weights as
>   * integral part of the sum. We therefore cannot decompose as per (3).
>   *
> - * OK, so what then?
> + * Another reason this doesn't work is that runnable isn't a 0-sum entity.
> + * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
> + * rq itself is runnable anywhere between 2/3 and 1 depending on how the
> + * runnable section of these tasks overlap (or not). If they were to perfectly
> + * align the rq as a whole would be runnable 2/3 of the time. If however we
> + * always have at least 1 runnable task, the rq as a whole is always runnable.
>   *
> + * So we'll have to approximate.. :/
>   *
> - * Another way to look at things is:
> + * Given the constraint:
>   *
> - *   grq->avg.load_avg = \Sum se->avg.load_avg
> + *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
>   *
> - * Therefore, per (2):
> + * We can construct a rule that adds runnable to a rq by assuming minimal
> + * overlap.
>   *
> - *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + * On removal, we'll assume each task is equally runnable; which yields:
>   *
> - * And the very thing we're propagating is a change in that sum (someone
> - * joined/left). So we can easily know the runnable change, which would be, per
> - * (2) the already tracked se->load_avg divided by the corresponding
> - * se->weight.
> + *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
>   *
> - * Basically (4) but in differential form:
> + * XXX: only do this for the part of runnable > running ?

That can be a possible improvement in how we are estimating the the runnable.
I'm going to make some trials to see the impact on the estimation

>   *
> - *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> - *                                                                (5)
> - *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
>   */
>
>  static inline void
> @@ -3478,6 +3479,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>         if (!delta)
>                 return;
>
> +       /*
> +        * The relation between sum and avg is:
> +        *
> +        *   LOAD_AVG_MAX - 1024 + sa->period_contrib
> +        *
> +        * however, the PELT windows are not aligned between grq and gse.
> +        */
> +
>         /* Set new sched_entity's utilization */
>         se->avg.util_avg = gcfs_rq->avg.util_avg;
>         se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> @@ -3490,33 +3499,65 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>  static inline void
>  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
>  {
> -       long runnable_sum = gcfs_rq->prop_runnable_sum;
> -       long runnable_load_avg, load_avg;
> -       s64 runnable_load_sum, load_sum;
> +       long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
> +       unsigned long runnable_load_avg, load_avg;
> +       u64 runnable_load_sum, load_sum = 0;
> +       s64 delta_sum;
>
>         if (!runnable_sum)
>                 return;
>
>         gcfs_rq->prop_runnable_sum = 0;
>
> +       if (runnable_sum >= 0) {
> +               /*
> +                * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
> +                * the CPU is saturated running == runnable.
> +                */
> +               runnable_sum += se->avg.load_sum;
> +               runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
> +       } else {
> +               /*
> +                * Estimate the departing task's runnable by assuming all tasks
> +                * are equally runnable.
> +                *
> +                * XXX: doesn't deal with multiple departures?

Why this would not deal with multiple departures ?
we are using gcfs_rq->avg.load_sum that reflects the new state of the
gcfs_rq to evaluate the runnable_sum

> +                */
> +               if (scale_load_down(gcfs_rq->load.weight)) {
> +                       load_sum = div_s64(gcfs_rq->avg.load_sum,
> +                               scale_load_down(gcfs_rq->load.weight));
> +               }
> +
> +               /* But make sure to not inflate se's runnable */
> +               runnable_sum = min(se->avg.load_sum, load_sum);
> +       }
> +
> +       /* runnable_sum can't be lower than running_sum */
> +       running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; /* XXX ? */

running_sum is scaled by cpu's capacity but not load_sum

I have made the shortcut of using SCHED_CAPACITY_SHIFT for capacity
but we might better use arch_scale_cpu_capacity(NULL, cpu) instead

> +       runnable_sum = max(runnable_sum, running_sum);
> +
>         load_sum = (s64)se_weight(se) * runnable_sum;
>         load_avg = div_s64(load_sum, LOAD_AVG_MAX);
>
> -       add_positive(&se->avg.load_sum, runnable_sum);
> -       add_positive(&se->avg.load_avg, load_avg);
> +       delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
> +       delta_avg = load_avg - se->avg.load_avg;
>
> -       add_positive(&cfs_rq->avg.load_avg, load_avg);
> -       add_positive(&cfs_rq->avg.load_sum, load_sum);
> +       se->avg.load_sum = runnable_sum;
> +       se->avg.load_avg = load_avg;
> +       add_positive(&cfs_rq->avg.load_avg, delta_avg);
> +       add_positive(&cfs_rq->avg.load_sum, delta_sum);
>
>         runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
>         runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
> +       delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
> +       delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
>
> -       add_positive(&se->avg.runnable_load_sum, runnable_sum);
> -       add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
> +       se->avg.runnable_load_sum = runnable_sum;
> +       se->avg.runnable_load_avg = runnable_load_avg;
>
>         if (se->on_rq) {
> -               add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
> -               add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
> +               add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
> +               add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
>         }
>  }

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-31 11:14                     ` Vincent Guittot
@ 2017-10-31 15:01                       ` Peter Zijlstra
  2017-10-31 16:38                         ` Vincent Guittot
  2017-11-16 14:09                         ` [PATCH v3] sched: Update runnable propagation rule Vincent Guittot
  0 siblings, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2017-10-31 15:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On Tue, Oct 31, 2017 at 12:14:11PM +0100, Vincent Guittot wrote:

> > +       if (runnable_sum >= 0) {
> > +               /*
> > +                * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
> > +                * the CPU is saturated running == runnable.
> > +                */
> > +               runnable_sum += se->avg.load_sum;
> > +               runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
> > +       } else {
> > +               /*
> > +                * Estimate the departing task's runnable by assuming all tasks
> > +                * are equally runnable.
> > +                *
> > +                * XXX: doesn't deal with multiple departures?
> 
> Why this would not deal with multiple departures ?
> we are using gcfs_rq->avg.load_sum that reflects the new state of the
> gcfs_rq to evaluate the runnable_sum

Ah, I figured the load_sum thing below reflected one average task worth
of runnable.


> > +       /* runnable_sum can't be lower than running_sum */
> > +       running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; /* XXX ? */
> 
> running_sum is scaled by cpu's capacity but not load_sum
> 
> I have made the shortcut of using SCHED_CAPACITY_SHIFT for capacity
> but we might better use arch_scale_cpu_capacity(NULL, cpu) instead

Ah, right. We should improve the comments thereabouts, I got totally
lost trying to track that yesterday.

Also; we should look at doing that invariant patch you're still sitting
on.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
  2017-10-31 15:01                       ` Peter Zijlstra
@ 2017-10-31 16:38                         ` Vincent Guittot
  2017-11-16 14:09                         ` [PATCH v3] sched: Update runnable propagation rule Vincent Guittot
  1 sibling, 0 replies; 57+ messages in thread
From: Vincent Guittot @ 2017-10-31 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Tejun Heo, Josef Bacik,
	Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason,
	Dietmar Eggemann, Morten Rasmussen, Ben Segall, Yuyang Du

On 31 October 2017 at 16:01, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 31, 2017 at 12:14:11PM +0100, Vincent Guittot wrote:
>
>> > +       if (runnable_sum >= 0) {
>> > +               /*
>> > +                * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
>> > +                * the CPU is saturated running == runnable.
>> > +                */
>> > +               runnable_sum += se->avg.load_sum;
>> > +               runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
>> > +       } else {
>> > +               /*
>> > +                * Estimate the departing task's runnable by assuming all tasks
>> > +                * are equally runnable.
>> > +                *
>> > +                * XXX: doesn't deal with multiple departures?
>>
>> Why this would not deal with multiple departures ?
>> we are using gcfs_rq->avg.load_sum that reflects the new state of the
>> gcfs_rq to evaluate the runnable_sum
>
> Ah, I figured the load_sum thing below reflected one average task worth
> of runnable.
>
>
>> > +       /* runnable_sum can't be lower than running_sum */
>> > +       running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; /* XXX ? */
>>
>> running_sum is scaled by cpu's capacity but not load_sum
>>
>> I have made the shortcut of using SCHED_CAPACITY_SHIFT for capacity
>> but we might better use arch_scale_cpu_capacity(NULL, cpu) instead
>
> Ah, right. We should improve the comments thereabouts, I got totally
> lost trying to track that yesterday.
>
> Also; we should look at doing that invariant patch you're still sitting
> on.

Yes. I have to rebase and test lastest changes i did

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v3] sched: Update runnable propagation rule
  2017-10-31 15:01                       ` Peter Zijlstra
  2017-10-31 16:38                         ` Vincent Guittot
@ 2017-11-16 14:09                         ` Vincent Guittot
  2017-11-16 14:21                           ` [PATCH v4] " Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Vincent Guittot @ 2017-11-16 14:09 UTC (permalink / raw)
  To: peterz, linux-kernel
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, Mike Galbraith,
	Chris Mason, Linus Torvalds, Dietmar Eggemann, Josef Bacik,
	Ben Segall, Paul Turner, Tejun Heo, Morten Rasmussen

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
---

Hi Peter,

I have rebased the patch, updated the 2 comments that were unclear and fixed
the computation of running_sum by using arch_scale_cpu_capacity() instead of 
>> SCHED_CAPACITY_SHIFT.

 kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 72 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0989676..05eabb2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3413,9 +3413,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- *
- * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
- * simply copies the running sum over.
+ * Per the above update_tg_cfs_util() is trivial and simply copies the running
+ * sum over (but still wrong, because the group entity and group rq do not have
+ * their PELT windows aligned).
  *
  * However, update_tg_cfs_runnable() is more complex. So we have:
  *
@@ -3424,11 +3424,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * And since, like util, the runnable part should be directly transferable,
  * the following would _appear_ to be the straight forward approach:
  *
- *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg	(3)
  *
  * And per (1) we have:
  *
- *   ge->avg.running_avg == grq->avg.running_avg
+ *   ge->avg.runnable_avg == grq->avg.runnable_avg
  *
  * Which gives:
  *
@@ -3447,27 +3447,28 @@ void set_task_rq_fair(struct sched_entity *se,
  * to (shortly) return to us. This only works by keeping the weights as
  * integral part of the sum. We therefore cannot decompose as per (3).
  *
- * OK, so what then?
+ * Another reason this doesn't work is that runnable isn't a 0-sum entity.
+ * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
+ * rq itself is runnable anywhere between 2/3 and 1 depending on how the
+ * runnable section of these tasks overlap (or not). If they were to perfectly
+ * align the rq as a whole would be runnable 2/3 of the time. If however we
+ * always have at least 1 runnable task, the rq as a whole is always runnable.
  *
+ * So we'll have to approximate.. :/
  *
- * Another way to look at things is:
+ * Given the constraint:
  *
- *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
  *
- * Therefore, per (2):
+ * We can construct a rule that adds runnable to a rq by assuming minimal
+ * overlap.
  *
- *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ * On removal, we'll assume each task is equally runnable; which yields:
  *
- * And the very thing we're propagating is a change in that sum (someone
- * joined/left). So we can easily know the runnable change, which would be, per
- * (2) the already tracked se->load_avg divided by the corresponding
- * se->weight.
+ *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
  *
- * Basically (4) but in differential form:
+ * XXX: only do this for the part of runnable > running ?
  *
- *   d(runnable_avg) += se->avg.load_avg / se->load.weight
- *								   (5)
- *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
  */
 
 static inline void
@@ -3479,6 +3480,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	if (!delta)
 		return;
 
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
@@ -3491,33 +3500,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	unsigned long runnable_load_avg, load_avg;
+	u64 runnable_load_sum, load_sum = 0;
+	s64 delta_sum;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/*
+		 * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
+		 * the CPU is saturated running == runnable.
+		 */
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
+	} else {
+		/*
+		 * Estimate the new unweighted runnable_sum of the gcfs_rq by
+		 * assuming all tasks are equally runnable.
+		 */
+		if (scale_load_down(gcfs_rq->load.weight)) {
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+		}
+
+		/* But make sure to not inflate se's runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/*
+	 * runnable_sum can't be lower than running_sum
+	 * As running sum is scale with cpu capacity wehreas the runnable sum
+	 * is not we rescale running_sum 1st
+	 */
+	running_sum = se->avg.util_sum / arch_scale_cpu_capacity(NULL, cpu)
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4] sched: Update runnable propagation rule
  2017-11-16 14:09                         ` [PATCH v3] sched: Update runnable propagation rule Vincent Guittot
@ 2017-11-16 14:21                           ` Vincent Guittot
  2017-12-06 11:40                             ` Peter Zijlstra
  2017-12-06 20:29                             ` [tip:sched/core] sched/fair: Update and fix the " tip-bot for Vincent Guittot
  0 siblings, 2 replies; 57+ messages in thread
From: Vincent Guittot @ 2017-11-16 14:21 UTC (permalink / raw)
  To: peterz, linux-kernel
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, Mike Galbraith,
	Chris Mason, Linus Torvalds, Dietmar Eggemann, Josef Bacik,
	Ben Segall, Paul Turner, Tejun Heo, Morten Rasmussen

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
---

Hi Peter,

Please forget v3 which doesn' compile.

I have rebased the patch, updated the 2 comments that were unclear and fixed
the computation of running_sum by using arch_scale_cpu_capacity() instead of 
>> SCHED_CAPACITY_SHIFT.

 kernel/sched/fair.c | 102 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0989676..7d4dd7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3413,9 +3413,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- *
- * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
- * simply copies the running sum over.
+ * Per the above update_tg_cfs_util() is trivial and simply copies the running
+ * sum over (but still wrong, because the group entity and group rq do not have
+ * their PELT windows aligned).
  *
  * However, update_tg_cfs_runnable() is more complex. So we have:
  *
@@ -3424,11 +3424,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * And since, like util, the runnable part should be directly transferable,
  * the following would _appear_ to be the straight forward approach:
  *
- *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg	(3)
  *
  * And per (1) we have:
  *
- *   ge->avg.running_avg == grq->avg.running_avg
+ *   ge->avg.runnable_avg == grq->avg.runnable_avg
  *
  * Which gives:
  *
@@ -3447,27 +3447,28 @@ void set_task_rq_fair(struct sched_entity *se,
  * to (shortly) return to us. This only works by keeping the weights as
  * integral part of the sum. We therefore cannot decompose as per (3).
  *
- * OK, so what then?
+ * Another reason this doesn't work is that runnable isn't a 0-sum entity.
+ * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
+ * rq itself is runnable anywhere between 2/3 and 1 depending on how the
+ * runnable section of these tasks overlap (or not). If they were to perfectly
+ * align the rq as a whole would be runnable 2/3 of the time. If however we
+ * always have at least 1 runnable task, the rq as a whole is always runnable.
  *
+ * So we'll have to approximate.. :/
  *
- * Another way to look at things is:
+ * Given the constraint:
  *
- *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
  *
- * Therefore, per (2):
+ * We can construct a rule that adds runnable to a rq by assuming minimal
+ * overlap.
  *
- *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ * On removal, we'll assume each task is equally runnable; which yields:
  *
- * And the very thing we're propagating is a change in that sum (someone
- * joined/left). So we can easily know the runnable change, which would be, per
- * (2) the already tracked se->load_avg divided by the corresponding
- * se->weight.
+ *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
  *
- * Basically (4) but in differential form:
+ * XXX: only do this for the part of runnable > running ?
  *
- *   d(runnable_avg) += se->avg.load_avg / se->load.weight
- *								   (5)
- *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
  */
 
 static inline void
@@ -3479,6 +3480,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	if (!delta)
 		return;
 
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
@@ -3491,33 +3500,68 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	unsigned long runnable_load_avg, load_avg;
+	u64 runnable_load_sum, load_sum = 0;
+	s64 delta_sum;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/*
+		 * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
+		 * the CPU is saturated running == runnable.
+		 */
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
+	} else {
+		/*
+		 * Estimate the new unweighted runnable_sum of the gcfs_rq by
+		 * assuming all tasks are equally runnable.
+		 */
+		if (scale_load_down(gcfs_rq->load.weight)) {
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+		}
+
+		/* But make sure to not inflate se's runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/*
+	 * runnable_sum can't be lower than running_sum
+	 * As running sum is scale with cpu capacity wehreas the runnable sum
+	 * is not we rescale running_sum 1st
+	 */
+	running_sum = se->avg.util_sum /
+		arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq)));
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v4] sched: Update runnable propagation rule
  2017-11-16 14:21                           ` [PATCH v4] " Vincent Guittot
@ 2017-12-06 11:40                             ` Peter Zijlstra
  2017-12-06 17:10                               ` Ingo Molnar
  2017-12-06 20:29                             ` [tip:sched/core] sched/fair: Update and fix the " tip-bot for Vincent Guittot
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2017-12-06 11:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Yuyang Du, Ingo Molnar, Mike Galbraith,
	Chris Mason, Linus Torvalds, Dietmar Eggemann, Josef Bacik,
	Ben Segall, Paul Turner, Tejun Heo, Morten Rasmussen

On Thu, Nov 16, 2017 at 03:21:52PM +0100, Vincent Guittot wrote:
> Unlike running, the runnable part can't be directly propagated through
> the hierarchy when we migrate a task. The main reason is that runnable
> time can be shared with other sched_entities that stay on the rq and
> this runnable time will also remain on prev cfs_rq and must not be
> removed.
> 
> Instead, we can estimate what should be the new runnable of the prev
> cfs_rq and check that this estimation stay in a possible range. The
> prop_runnable_sum is a good estimation when adding runnable_sum but
> fails most often when we remove it. Instead, we could use the formula
> below instead:
> 
>   gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
> 
> which assumes that tasks are equally runnable which is not true but
> easy to compute.
> 
> Beside these estimates, we have several simple rules that help us to filter
> out wrong ones:
> 
>  - ge->avg.runnable_sum <= than LOAD_AVG_MAX
>  - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
>  - ge->avg.runnable_sum can't increase when we detach a task
> 
> Cc: Yuyang Du <yuyang.du@intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Chris Mason <clm@fb.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Ingo, can you stuff this in sched/urgent ?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4] sched: Update runnable propagation rule
  2017-12-06 11:40                             ` Peter Zijlstra
@ 2017-12-06 17:10                               ` Ingo Molnar
  0 siblings, 0 replies; 57+ messages in thread
From: Ingo Molnar @ 2017-12-06 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, Yuyang Du, Mike Galbraith,
	Chris Mason, Linus Torvalds, Dietmar Eggemann, Josef Bacik,
	Ben Segall, Paul Turner, Tejun Heo, Morten Rasmussen


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Nov 16, 2017 at 03:21:52PM +0100, Vincent Guittot wrote:
> > Unlike running, the runnable part can't be directly propagated through
> > the hierarchy when we migrate a task. The main reason is that runnable
> > time can be shared with other sched_entities that stay on the rq and
> > this runnable time will also remain on prev cfs_rq and must not be
> > removed.
> > 
> > Instead, we can estimate what should be the new runnable of the prev
> > cfs_rq and check that this estimation stay in a possible range. The
> > prop_runnable_sum is a good estimation when adding runnable_sum but
> > fails most often when we remove it. Instead, we could use the formula
> > below instead:
> > 
> >   gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
> > 
> > which assumes that tasks are equally runnable which is not true but
> > easy to compute.
> > 
> > Beside these estimates, we have several simple rules that help us to filter
> > out wrong ones:
> > 
> >  - ge->avg.runnable_sum <= than LOAD_AVG_MAX
> >  - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
> >  - ge->avg.runnable_sum can't increase when we detach a task
> > 
> > Cc: Yuyang Du <yuyang.du@intel.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Chris Mason <clm@fb.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Josef Bacik <josef@toxicpanda.com>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Paul Turner <pjt@google.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Ingo, can you stuff this in sched/urgent ?

Yeah, I've queued up in tip:sched/urgent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [tip:sched/core] sched/fair: Update and fix the runnable propagation rule
  2017-11-16 14:21                           ` [PATCH v4] " Vincent Guittot
  2017-12-06 11:40                             ` Peter Zijlstra
@ 2017-12-06 20:29                             ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 57+ messages in thread
From: tip-bot for Vincent Guittot @ 2017-12-06 20:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: yuyang.du, bsegall, tj, efault, josef, hpa, morten.rasmussen,
	tglx, vincent.guittot, dietmar.eggemann, clm, linux-kernel, pjt,
	peterz, torvalds, mingo

Commit-ID:  a4c3c04974d648ee6e1a09ef4131eb32a02ab494
Gitweb:     https://git.kernel.org/tip/a4c3c04974d648ee6e1a09ef4131eb32a02ab494
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 16 Nov 2017 15:21:52 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 6 Dec 2017 19:30:50 +0100

sched/fair: Update and fix the runnable propagation rule

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

The effect of these fixes is better cgroups balancing.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 102 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4037e19..2fe3aa8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3413,9 +3413,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- *
- * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
- * simply copies the running sum over.
+ * Per the above update_tg_cfs_util() is trivial and simply copies the running
+ * sum over (but still wrong, because the group entity and group rq do not have
+ * their PELT windows aligned).
  *
  * However, update_tg_cfs_runnable() is more complex. So we have:
  *
@@ -3424,11 +3424,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * And since, like util, the runnable part should be directly transferable,
  * the following would _appear_ to be the straight forward approach:
  *
- *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg	(3)
  *
  * And per (1) we have:
  *
- *   ge->avg.running_avg == grq->avg.running_avg
+ *   ge->avg.runnable_avg == grq->avg.runnable_avg
  *
  * Which gives:
  *
@@ -3447,27 +3447,28 @@ void set_task_rq_fair(struct sched_entity *se,
  * to (shortly) return to us. This only works by keeping the weights as
  * integral part of the sum. We therefore cannot decompose as per (3).
  *
- * OK, so what then?
+ * Another reason this doesn't work is that runnable isn't a 0-sum entity.
+ * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
+ * rq itself is runnable anywhere between 2/3 and 1 depending on how the
+ * runnable section of these tasks overlap (or not). If they were to perfectly
+ * align the rq as a whole would be runnable 2/3 of the time. If however we
+ * always have at least 1 runnable task, the rq as a whole is always runnable.
  *
+ * So we'll have to approximate.. :/
  *
- * Another way to look at things is:
+ * Given the constraint:
  *
- *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
  *
- * Therefore, per (2):
+ * We can construct a rule that adds runnable to a rq by assuming minimal
+ * overlap.
  *
- *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ * On removal, we'll assume each task is equally runnable; which yields:
  *
- * And the very thing we're propagating is a change in that sum (someone
- * joined/left). So we can easily know the runnable change, which would be, per
- * (2) the already tracked se->load_avg divided by the corresponding
- * se->weight.
+ *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
  *
- * Basically (4) but in differential form:
+ * XXX: only do this for the part of runnable > running ?
  *
- *   d(runnable_avg) += se->avg.load_avg / se->load.weight
- *								   (5)
- *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
  */
 
 static inline void
@@ -3479,6 +3480,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	if (!delta)
 		return;
 
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
@@ -3491,33 +3500,68 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	unsigned long runnable_load_avg, load_avg;
+	u64 runnable_load_sum, load_sum = 0;
+	s64 delta_sum;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/*
+		 * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
+		 * the CPU is saturated running == runnable.
+		 */
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
+	} else {
+		/*
+		 * Estimate the new unweighted runnable_sum of the gcfs_rq by
+		 * assuming all tasks are equally runnable.
+		 */
+		if (scale_load_down(gcfs_rq->load.weight)) {
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+		}
+
+		/* But make sure to not inflate se's runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/*
+	 * runnable_sum can't be lower than running_sum
+	 * As running sum is scale with cpu capacity wehreas the runnable sum
+	 * is not we rescale running_sum 1st
+	 */
+	running_sum = se->avg.util_sum /
+		arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq)));
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2017-12-06 20:35 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-01 13:20 [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 01/18] sched/fair: Clean up calc_cfs_shares() Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() Peter Zijlstra
2017-09-28 10:03   ` Morten Rasmussen
2017-09-29 11:35     ` Peter Zijlstra
2017-09-29 13:03       ` Morten Rasmussen
2017-09-01 13:21 ` [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight_entity() Peter Zijlstra
2017-09-29  9:04   ` Morten Rasmussen
2017-09-29 11:38     ` Peter Zijlstra
2017-09-29 13:00       ` Morten Rasmussen
2017-09-01 13:21 ` [PATCH -v2 04/18] sched/fair: Remove se->load.weight from se->avg.load_sum Peter Zijlstra
2017-09-29 15:26   ` Morten Rasmussen
2017-09-29 16:39     ` Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 05/18] sched/fair: Change update_load_avg() arguments Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 06/18] sched/fair: Move enqueue migrate handling Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 07/18] sched/fair: Rename {en,de}queue_entity_load_avg() Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 08/18] sched/fair: Introduce {en,de}queue_load_avg() Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 09/18] sched/fair: More accurate reweight_entity() Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 10/18] sched/fair: Use reweight_entity() for set_user_nice() Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 11/18] sched/fair: Rewrite cfs_rq->removed_*avg Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation Peter Zijlstra
2017-10-09  8:08   ` Morten Rasmussen
2017-10-09  9:45     ` Peter Zijlstra
2017-10-18 12:45       ` Morten Rasmussen
2017-10-30 13:35         ` Peter Zijlstra
2017-10-09 15:03   ` Vincent Guittot
2017-10-09 15:29     ` Vincent Guittot
2017-10-10  7:29       ` Peter Zijlstra
2017-10-10  7:44         ` Vincent Guittot
2017-10-13 15:22           ` Vincent Guittot
2017-10-13 20:41             ` Peter Zijlstra
2017-10-15 12:01               ` Vincent Guittot
2017-10-16 13:55               ` Vincent Guittot
2017-10-19 15:04                 ` Vincent Guittot
2017-10-30 17:20                   ` Peter Zijlstra
2017-10-31 11:14                     ` Vincent Guittot
2017-10-31 15:01                       ` Peter Zijlstra
2017-10-31 16:38                         ` Vincent Guittot
2017-11-16 14:09                         ` [PATCH v3] sched: Update runnable propagation rule Vincent Guittot
2017-11-16 14:21                           ` [PATCH v4] " Vincent Guittot
2017-12-06 11:40                             ` Peter Zijlstra
2017-12-06 17:10                               ` Ingo Molnar
2017-12-06 20:29                             ` [tip:sched/core] sched/fair: Update and fix the " tip-bot for Vincent Guittot
2017-09-01 13:21 ` [PATCH -v2 13/18] sched/fair: Propagate an effective runnable_load_avg Peter Zijlstra
2017-10-02 17:46   ` Dietmar Eggemann
2017-10-03  8:50     ` Peter Zijlstra
2017-10-03  9:29       ` Dietmar Eggemann
2017-10-03 12:26   ` Dietmar Eggemann
2017-09-01 13:21 ` [PATCH -v2 14/18] sched/fair: Synchonous PELT detach on load-balance migrate Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se Peter Zijlstra
2017-10-04 19:27   ` Dietmar Eggemann
2017-10-06 13:02     ` Peter Zijlstra
2017-10-09 12:15       ` Dietmar Eggemann
2017-10-09 12:19         ` Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 16/18] sched/fair: More accurate async detach Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 17/18] sched/fair: Calculate runnable_weight slightly differently Peter Zijlstra
2017-09-01 13:21 ` [PATCH -v2 18/18] sched/fair: Update calc_group_*() comments Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.