From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1176663AbdDYIg3 (ORCPT ); Tue, 25 Apr 2017 04:36:29 -0400 Received: from mail-oi0-f43.google.com ([209.85.218.43]:33909 "EHLO mail-oi0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1176642AbdDYIgO (ORCPT ); Tue, 25 Apr 2017 04:36:14 -0400 MIME-Version: 1.0 In-Reply-To: <20170424201415.GB14169@wtj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201415.GB14169@wtj.duckdns.org> From: Vincent Guittot Date: Tue, 25 Apr 2017 10:35:53 +0200 Message-ID: Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Tejun On 24 April 2017 at 22:14, Tejun Heo wrote: > 09a43ace1f98 ("sched/fair: Propagate load during synchronous > attach/detach") added immediate load propagation from cfs_rq to its > sched_entity then to the parent cfs_rq; however, what gets propagated > doesn't seem to make sense. > > It repeats the tg_weight calculation done in calc_cfs_shares() but > only uses it to compensate for shares being out of date. After that, > it sets the sched_entity's load_avg to the load_avg of the > corresponding cfs_rq. > > This doesn't make sense as the cfs_rq's load_avg is some fraction of > its total weight, which the sched_entity's weight has nothing to with. > For example, if the cfs_rq has a single constant load 1 task the > cfs_rq's load_avg would be around 1. If that cfs_rq is the only > active sched_entity in the parent cfs_rq which has the maximum weight, > the sched_entity's load should be around the maximum weight but > update_tg_cfs_load() ends up overriding it to 1. not sure to catch your example: a task TA with a load_avg = 1 is the only task in a task group GB so the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has got a weight of 1024 (I use 10bits format for readability) which is the total share of task group GB Are you saying that the group_entity load_avg should be around 1024 and not 1 ? I would say it depends of TA weight. I assume that TA weight is the default value (1024) as you don't specify any value in your example If TA directly runs at parent level, its sched_entity would have a load_avg of 1 so why the group entity load_avg should be 1024 ? it will just temporally show the cfs_rq more loaded than it is really and at the end the group entity load_avg will go back to 1 > > At the parent's level, the absolute value of load_avg inside a child > cfs_rq doesn't mean anything. Only the ratio against its weight is > meaningful. > > This patch changes update_tg_cfs_load() to normalize the > runnable_load_avg of the cfs_rq and then scale it to the matching > sched_entity's freshly calculated shares for propagation. Use of > runnable_load_avg instead of load_avg is intentional and keeps the > parent's runnable_load_avg true to the sum of scaled loads of all > tasks queued under it which is critical for the correction operation > of load balancer. The next patch will depend on it. > > Signed-off-by: Tejun Heo > Cc: Vincent Guittot > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Mike Galbraith > Cc: Paul Turner > --- > kernel/sched/fair.c | 46 +++++++++++++++++++--------------------------- > 1 file changed, 19 insertions(+), 27 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3078,37 +3078,29 @@ static inline void > update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > struct cfs_rq *gcfs_rq = group_cfs_rq(se); > - long delta, load = gcfs_rq->avg.load_avg; > + long load = 0, delta; > > /* > - * If the load of group cfs_rq is null, the load of the > - * sched_entity will also be null so we can skip the formula > + * A cfs_rq's load avg contribution to the parent should be scaled > + * to the sched_entity's weight. Use freshly calculated shares > + * instead of @se->load.weight as the latter may not reflect > + * changes from the current scheduling operation. > + * > + * Note that the propagation source is runnable_load_avg instead of > + * load_avg. This keeps every cfs_rq's runnable_load_avg true to > + * the sum of the scaled loads of all tasks queued under it, which > + * is important for the correct operation of the load balancer. > + * > + * This can make the sched_entity's load_avg jumpier but that > + * correctly reflects what would happen without cgroups if each > + * task's load is scaled across nesting - the load is being > + * averaged at the task and each cfs_rq. > */ > - if (load) { > - long tg_load; > + if (gcfs_rq->load.weight) { > + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg); > > - /* Get tg's load and ensure tg_load > 0 */ > - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1; > - > - /* Ensure tg_load >= load and updated with current load*/ > - tg_load -= gcfs_rq->tg_load_avg_contrib; > - tg_load += load; > - > - /* > - * We need to compute a correction term in the case that the > - * task group is consuming more CPU than a task of equal > - * weight. A task with a weight equals to tg->shares will have > - * a load less or equal to scale_load_down(tg->shares). > - * Similarly, the sched_entities that represent the task group > - * at parent level, can't have a load higher than > - * scale_load_down(tg->shares). And the Sum of sched_entities' > - * load must be <= scale_load_down(tg->shares). > - */ > - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) { > - /* scale gcfs_rq's load into tg's shares*/ > - load *= scale_load_down(gcfs_rq->tg->shares); > - load /= tg_load; > - } > + load = min(gcfs_rq->runnable_load_avg * > + shares / gcfs_rq->load.weight, shares); > } > > delta = load - se->avg.load_avg;