[PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

From: Tejun Heo <tj@kernel.org>
To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
Date: Mon, 24 Apr 2017 14:33:24 -0700	[thread overview]
Message-ID: <20170424213324.GA23619@wtj.duckdns.org> (raw)
In-Reply-To: <20170424201415.GB14169@wtj.duckdns.org>

09a43ace1f98 ("sched/fair: Propagate load during synchronous
attach/detach") added immediate load propagation from cfs_rq to its
sched_entity then to the parent cfs_rq; however, what gets propagated
doesn't seem to make sense.

It repeats the tg_weight calculation done in calc_cfs_shares() but
only uses it to compensate for shares being out of date.  After that,
it sets the sched_entity's load_avg to the load_avg of the
corresponding cfs_rq.

This doesn't make sense as the cfs_rq's load_avg is some fraction of
its total weight, which the sched_entity's weight has nothing to with.
For example, if the cfs_rq has a single constant load 1 task the
cfs_rq's load_avg would be around 1.  If that cfs_rq is the only
active sched_entity in the parent cfs_rq which has the maximum weight,
the sched_entity's load should be around the maximum weight but
update_tg_cfs_load() ends up overriding it to 1.

At the parent's level, the absolute value of load_avg inside a child
cfs_rq doesn't mean anything.  Only the ratio against its weight is
meaningful.

This patch changes update_tg_cfs_load() to normalize the
runnable_load_avg of the cfs_rq and then scale it to the matching
sched_entity's freshly calculated shares for propagation.  Use of
runnable_load_avg instead of load_avg is intentional and keeps the
parent's runnable_load_avg true to the sum of scaled loads of all
tasks queued under it which is critical for the correction operation
of load balancer.  The next patch will depend on it.

v2: Use min_t() to squash a build warning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c |   47 ++++++++++++++++++++---------------------------
 1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long load = 0, delta;
 
 	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
+	 * A cfs_rq's load avg contribution to the parent should be scaled
+	 * to the sched_entity's weight.  Use freshly calculated shares
+	 * instead of @se->load.weight as the latter may not reflect
+	 * changes from the current scheduling operation.
+	 *
+	 * Note that the propagation source is runnable_load_avg instead of
+	 * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
+	 * the sum of the scaled loads of all tasks queued under it, which
+	 * is important for the correct operation of the load balancer.
+	 *
+	 * This can make the sched_entity's load_avg jumpier but that
+	 * correctly reflects what would happen without cgroups if each
+	 * task's load is scaled across nesting - the load is being
+	 * averaged at the task and each cfs_rq.
 	 */
-	if (load) {
-		long tg_load;
+	if (gcfs_rq->load.weight) {
+		long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
+		load = min_t(long, shares,
+			     gcfs_rq->runnable_load_avg *
+			     shares / gcfs_rq->load.weight);
 	}
 
 	delta = load - se->avg.load_avg;