From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1176834AbdDYIrF (ORCPT ); Tue, 25 Apr 2017 04:47:05 -0400 Received: from mail-oi0-f42.google.com ([209.85.218.42]:33299 "EHLO mail-oi0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758297AbdDYIq6 (ORCPT ); Tue, 25 Apr 2017 04:46:58 -0400 MIME-Version: 1.0 In-Reply-To: <20170424201444.GC14169@wtj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> From: Vincent Guittot Date: Tue, 25 Apr 2017 10:46:37 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 24 April 2017 at 22:14, Tejun Heo wrote: > We noticed that with cgroup CPU controller in use, the scheduling > latency gets wonky regardless of nesting level or weight > configuration. This is easily reproducible with Chris Mason's > schbench[1]. > > All tests are run on a single socket, 16 cores, 32 threads machine. > While the machine is mostly idle, it isn't completey. There's always > some variable management load going on. The command used is > > schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > > which measures scheduling latency with 32 threads each of which > repeatedly runs for 15ms and then sleeps for 10ms. Here's a > representative result when running from the root cgroup. > > # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > Latency percentiles (usec) > 50.0000th: 26 > 75.0000th: 62 > 90.0000th: 74 > 95.0000th: 86 > *99.0000th: 887 > 99.5000th: 3692 > 99.9000th: 10832 > min=0, max=13374 > > The following is inside a first level CPU cgroup with the maximum > weight. > > # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > Latency percentiles (usec) > 50.0000th: 31 > 75.0000th: 65 > 90.0000th: 71 > 95.0000th: 91 > *99.0000th: 7288 > 99.5000th: 10352 > 99.9000th: 12496 > min=0, max=13023 > > Note the drastic increase in p99 scheduling latency. After > investigation, it turned out that the update_sd_lb_stats(), which is > used by load_balance() to pick the most loaded group, was often > picking the wrong group. A CPU which has one schbench running and > another queued wouldn't report the correspondingly higher > weighted_cpuload() and get looked over as the target of load > balancing. > > weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the > sum of the load_avg of all queued sched_entities. Without cgroups or > at the root cgroup, each task's load_avg contributes directly to the > sum. When a task wakes up or goes to sleep, the change is immediately > reflected on runnable_load_avg which in turn affects load balancing. > > However, when CPU cgroup is in use, a nesting cfs_rq blocks this > immediate reflection. When a task wakes up inside a cgroup, the > nested cfs_rq's runnable_load_avg is updated but doesn't get > propagated to the parent. It only affects the matching sched_entity's > load_avg over time which then gets propagated to the runnable_load_avg > at that level. This makes weighted_cpuload() often temporarily out of > sync leading to suboptimal behavior of load_balance() and increase in > scheduling latencies as shown above. > > This patch fixes the issue by updating propagate_entity_load_avg() to > always propagate to the parent's runnable_load_avg. Combined with the > previous patch, this keeps a cfs_rq's runnable_load_avg always the sum > of the scaled loads of all tasks queued below removing the artifacts > from nesting cfs_rqs. The following is from inside three levels of > nesting with the patch applied. So you are changing the purpose of propagate_entity_load_avg which aims to propagate load_avg/util_avg changes only when a task migrate and you also want to propagate the enqueue/dequeue in the parent cfs_rq->runnable_load_avg > > # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > Latency percentiles (usec) > 50.0000th: 40 > 75.0000th: 71 > 90.0000th: 89 > 95.0000th: 108 > *99.0000th: 679 > 99.5000th: 3500 > 99.9000th: 10960 > min=0, max=13790 > > [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git > > Signed-off-by: Tejun Heo > Cc: Vincent Guittot > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Mike Galbraith > Cc: Paul Turner > Cc: Chris Mason > --- > kernel/sched/fair.c | 34 +++++++++++++++++++++------------- > 1 file changed, 21 insertions(+), 13 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq > > /* Take into account change of load of a child task group */ > static inline void > -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) > +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, > + bool propagate_avg) > { > struct cfs_rq *gcfs_rq = group_cfs_rq(se); > long load = 0, delta; > @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq > se->avg.load_avg = load; > se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; > > - /* Update parent cfs_rq load */ > - add_positive(&cfs_rq->avg.load_avg, delta); > - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; > + if (propagate_avg) { > + /* Update parent cfs_rq load */ > + add_positive(&cfs_rq->avg.load_avg, delta); > + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; > + } > > /* > * If the sched_entity is already enqueued, we also have to update the > @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_ > /* Update task and its cfs_rq load average */ > static inline int propagate_entity_load_avg(struct sched_entity *se) > { > - struct cfs_rq *cfs_rq; > + struct cfs_rq *cfs_rq = cfs_rq_of(se); > + bool propagate_avg; > > if (entity_is_task(se)) > return 0; > > - if (!test_and_clear_tg_cfs_propagate(se)) > - return 0; > - > - cfs_rq = cfs_rq_of(se); > + propagate_avg = test_and_clear_tg_cfs_propagate(se); > > - set_tg_cfs_propagate(cfs_rq); > + /* > + * We want to keep @cfs_rq->runnable_load_avg always in sync so > + * that the load balancer can accurately determine the busiest CPU > + * regardless of cfs_rq nesting. > + */ > + update_tg_cfs_load(cfs_rq, se, propagate_avg); > > - update_tg_cfs_util(cfs_rq, se); > - update_tg_cfs_load(cfs_rq, se); > + if (propagate_avg) { > + set_tg_cfs_propagate(cfs_rq); > + update_tg_cfs_util(cfs_rq, se); > + } > > - return 1; > + return propagate_avg; > } > > #else /* CONFIG_FAIR_GROUP_SCHED */