From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756540AbcIUSDD (ORCPT ); Wed, 21 Sep 2016 14:03:03 -0400 Received: from mail-lf0-f44.google.com ([209.85.215.44]:33469 "EHLO mail-lf0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755510AbcIUSDC (ORCPT ); Wed, 21 Sep 2016 14:03:02 -0400 MIME-Version: 1.0 In-Reply-To: References: <1473666472-13749-1-git-send-email-vincent.guittot@linaro.org> <1473666472-13749-3-git-send-email-vincent.guittot@linaro.org> From: Vincent Guittot Date: Wed, 21 Sep 2016 20:02:39 +0200 Message-ID: Subject: Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list To: Dietmar Eggemann Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , Yuyang Du , Morten Rasmussen , Linaro Kernel Mailman List , Paul Turner , Benjamin Segall Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 21 September 2016 at 19:25, Dietmar Eggemann wrote: > On 21/09/16 13:34, Vincent Guittot wrote: >> Hi Dietmar, >> >> On 21 September 2016 at 12:14, Dietmar Eggemann >> wrote: >>> Hi Vincent, >>> >>> On 12/09/16 08:47, Vincent Guittot wrote: > > [...] > >>> I guess in the meantime we lost the functionality to remove a cfs_rq from the >>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates >>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) >>> owned by Cpu1 stay in the leaf_cfs_rq list. >>> >>> Shouldn't we reintegrate it? Following patch goes on top of this patch: >> >> I see one problem: Once a cfs_rq has been removed , it will not be >> periodically updated anymore until the next enqueue. A sleeping task >> is only attached but not enqueued when it moves into a cfs_rq so its >> load is added to cfs_rq's load which have to be periodically >> updated/decayed. So adding a cfs_rq to the list only during an enqueue >> is not enough in this case. > > There was more in the original patch (commit 82958366cfea), the removal of the > cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had > decayed to 0. Couldn't we use something similar with se->avg.load_avg instead > to make sure that these blocked contributions have been decayed before we do the > removal? Yes we can use se->avg.load_avg but that's not enough. Once, se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we remove the cfs_rq from the list and this cfs_rq will not be updated until a new sched_entity is enqueued. But when you move a sleeping task between 2 task groups, the load of the task is attached to the destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no more null but it will not be updated during update_blocked_averages because the cfs_rq is no more in the list. So waiting for the enqueue of the sched_entity to add the cfs_rq back in the list is no more enough. You will have to do that during attach too > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4ac1718560e2..3595b0623b37 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu) > * list_add_leaf_cfs_rq() for details. > */ > for_each_leaf_cfs_rq(rq, cfs_rq) { > + struct sched_entity *se = cfs_rq->tg->se[cpu]; > + > /* throttled entities do not contribute to load */ > if (throttled_hierarchy(cfs_rq)) > continue; > @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu) > update_tg_load_avg(cfs_rq, 0); > > /* Propagate pending load changes to the parent */ > - if (cfs_rq->tg->se[cpu]) > - update_load_avg(cfs_rq->tg->se[cpu], 0, 0); > + if (se) { > + update_load_avg(se, 0, 0); > + > + if (!se->avg.load_avg && !cfs_rq->nr_running) > + list_del_leaf_cfs_rq(cfs_rq); > + } > } > raw_spin_unlock_irqrestore(&rq->lock, flags); > } > > >> Then, only the highest child level of task group will be removed >> because cfs_rq->nr_running will be > 0 as soon as a child task group >> is created and enqueued into a task group. Or you should use >> cfs.h_nr_running instead i don't know all implications > > In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something? no you don't miss anything. I have replied a bit too quickly on that point; the sched_entity of a tsk_group is dequeued when its group cfs_rq becomes idle so the parent cfs_rq->nr_running can be null > > migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2 > migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097550d700 tg_id=6 on_list=0 > migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800975903700 tg_id=4 on_list=0 > migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800974e0fe00 tg_id=2 on_list=0 > migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097fecb6e0 tg_id=1 on_list=1 > migration/1-16 [001] 67.235309: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1 > -> migration/1-16 [001] 67.235312: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1 > migration/1-16 [001] 67.235314: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1 > -> migration/1-16 [001] 67.235315: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1 > migration/1-16 [001] 67.235316: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1 > -> migration/1-16 [001] 67.235318: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1 > migration/1-16 [001] 67.235319: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097feb56e0 tg_id=1 on_list=1 > > If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists. Yes i agree. The main drawback is probably that update_blocked_averages continue to loop on these cfs_rq during periodic load balance. We have to compare the cost of adding a branch of cfs_rqs into the list during the enqueue/attach of a sched_entity with the cost of the useless loops on these cfs_rq without load during periodic load balance. > > [...]