[PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Tejun Heo <tj@kernel.org>
To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Mon, 24 Apr 2017 13:14:44 -0700	[thread overview]
Message-ID: <20170424201444.GC14169@wtj.duckdns.org> (raw)
In-Reply-To: <20170424201344.GA14169@wtj.duckdns.org>

We noticed that with cgroup CPU controller in use, the scheduling
latency gets wonky regardless of nesting level or weight
configuration.  This is easily reproducible with Chris Mason's
schbench[1].

All tests are run on a single socket, 16 cores, 32 threads machine.
While the machine is mostly idle, it isn't completey.  There's always
some variable management load going on.  The command used is

 schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

which measures scheduling latency with 32 threads each of which
repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
representative result when running from the root cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 26
	 75.0000th: 62
	 90.0000th: 74
	 95.0000th: 86
	 *99.0000th: 887
	 99.5000th: 3692
	 99.9000th: 10832
	 min=0, max=13374

The following is inside a first level CPU cgroup with the maximum
weight.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 31
	 75.0000th: 65
	 90.0000th: 71
	 95.0000th: 91
	 *99.0000th: 7288
	 99.5000th: 10352
	 99.9000th: 12496
	 min=0, max=13023

Note the drastic increase in p99 scheduling latency.  After
investigation, it turned out that the update_sd_lb_stats(), which is
used by load_balance() to pick the most loaded group, was often
picking the wrong group.  A CPU which has one schbench running and
another queued wouldn't report the correspondingly higher
weighted_cpuload() and get looked over as the target of load
balancing.

weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
sum of the load_avg of all queued sched_entities.  Without cgroups or
at the root cgroup, each task's load_avg contributes directly to the
sum.  When a task wakes up or goes to sleep, the change is immediately
reflected on runnable_load_avg which in turn affects load balancing.

However, when CPU cgroup is in use, a nesting cfs_rq blocks this
immediate reflection.  When a task wakes up inside a cgroup, the
nested cfs_rq's runnable_load_avg is updated but doesn't get
propagated to the parent.  It only affects the matching sched_entity's
load_avg over time which then gets propagated to the runnable_load_avg
at that level.  This makes weighted_cpuload() often temporarily out of
sync leading to suboptimal behavior of load_balance() and increase in
scheduling latencies as shown above.

This patch fixes the issue by updating propagate_entity_load_avg() to
always propagate to the parent's runnable_load_avg.  Combined with the
previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
of the scaled loads of all tasks queued below removing the artifacts
from nesting cfs_rqs.  The following is from inside three levels of
nesting with the patch applied.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 40
	 75.0000th: 71
	 90.0000th: 89
	 95.0000th: 108
	 *99.0000th: 679
	 99.5000th: 3500
	 99.9000th: 10960
	 min=0, max=13790

[1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
---
 kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 
 /* Take into account change of load of a child task group */
 static inline void
-update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
+		   bool propagate_avg)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long load = 0, delta;
@@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 	se->avg.load_avg = load;
 	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
 
-	/* Update parent cfs_rq load */
-	add_positive(&cfs_rq->avg.load_avg, delta);
-	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+	if (propagate_avg) {
+		/* Update parent cfs_rq load */
+		add_positive(&cfs_rq->avg.load_avg, delta);
+		cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+	}
 
 	/*
 	 * If the sched_entity is already enqueued, we also have to update the
@@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
 /* Update task and its cfs_rq load average */
 static inline int propagate_entity_load_avg(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	bool propagate_avg;
 
 	if (entity_is_task(se))
 		return 0;
 
-	if (!test_and_clear_tg_cfs_propagate(se))
-		return 0;
-
-	cfs_rq = cfs_rq_of(se);
+	propagate_avg = test_and_clear_tg_cfs_propagate(se);
 
-	set_tg_cfs_propagate(cfs_rq);
+	/*
+	 * We want to keep @cfs_rq->runnable_load_avg always in sync so
+	 * that the load balancer can accurately determine the busiest CPU
+	 * regardless of cfs_rq nesting.
+	 */
+	update_tg_cfs_load(cfs_rq, se, propagate_avg);
 
-	update_tg_cfs_util(cfs_rq, se);
-	update_tg_cfs_load(cfs_rq, se);
+	if (propagate_avg) {
+		set_tg_cfs_propagate(cfs_rq);
+		update_tg_cfs_util(cfs_rq, se);
+	}
 
-	return 1;
+	return propagate_avg;
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */