From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756126AbcIUMeZ (ORCPT ); Wed, 21 Sep 2016 08:34:25 -0400 Received: from mail-lf0-f47.google.com ([209.85.215.47]:36463 "EHLO mail-lf0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753942AbcIUMeX (ORCPT ); Wed, 21 Sep 2016 08:34:23 -0400 MIME-Version: 1.0 In-Reply-To: References: <1473666472-13749-1-git-send-email-vincent.guittot@linaro.org> <1473666472-13749-3-git-send-email-vincent.guittot@linaro.org> From: Vincent Guittot Date: Wed, 21 Sep 2016 14:34:01 +0200 Message-ID: Subject: Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list To: Dietmar Eggemann Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , Yuyang Du , Morten Rasmussen , Linaro Kernel Mailman List , Paul Turner , Benjamin Segall Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Dietmar, On 21 September 2016 at 12:14, Dietmar Eggemann wrote: > Hi Vincent, > > On 12/09/16 08:47, Vincent Guittot wrote: >> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that >> a child will always be called before its parent. >> >> The hierarchical order in shares update list has been introduced by >> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list") >> >> With the current implementation a child can be still put after its parent. >> >> Lets take the example of >> root >> \ >> b >> /\ >> c d* >> | >> e* >> >> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list >> looks like: head -> c -> b -> root -> tail >> >> The branch d -> e will be added the first time that they are enqueued, >> starting with e then d. >> >> When e is added, its parents is not already on the list so e is put at the >> tail : head -> c -> b -> root -> e -> tail >> >> Then, d is added at the head because its parent is already on the list: >> head -> d -> c -> b -> root -> e -> tail >> >> e is not placed at the right position and will be called the last whereas >> it should be called at the beginning. >> >> Because it follows the bottom-up enqueue sequence, we are sure that we >> will finished to add either a cfs_rq without parent or a cfs_rq with a parent >> that is already on the list. We can use this event to detect when we have >> finished to add a new branch. For the others, whose parents are not already >> added, we have to ensure that they will be added after their children that >> have just been inserted the steps before, and after any potential parents that >> are already in the list. The easiest way is to put the cfs_rq just after the >> last inserted one and to keep track of it untl the branch is fully added. > > [...] > > I tested it with a multi-level task group hierarchy and it does the > right thing for this testcase (task t runs alternately on Cpu0 and Cpu1 > in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6). Thanks for testing > > I wonder if this patch is related to the comment "TODO: fix up out-of-order > children on enqueue." in update_shares_cpu() of commit 82958366cfea > ("sched: Replace update_shares weight distribution with per-entity > computation")? I had not that comment in mind when i have done this patch only to ensure that the propagation of load and utilization in children will be done before testing their parents That being said, I agree that the comment probably points to the same ordering issue than what this patch solves > > I guess in the meantime we lost the functionality to remove a cfs_rq from the > leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates > away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) > owned by Cpu1 stay in the leaf_cfs_rq list. > > Shouldn't we reintegrate it? Following patch goes on top of this patch: I see one problem: Once a cfs_rq has been removed , it will not be periodically updated anymore until the next enqueue. A sleeping task is only attached but not enqueued when it moves into a cfs_rq so its load is added to cfs_rq's load which have to be periodically updated/decayed. So adding a cfs_rq to the list only during an enqueue is not enough in this case. Then, only the highest child level of task group will be removed because cfs_rq->nr_running will be > 0 as soon as a child task group is created and enqueued into a task group. Or you should use cfs.h_nr_running instead i don't know all implications Regards, Vincent > > -- >8 -- > > From: Dietmar Eggemann > Date: Tue, 20 Sep 2016 17:30:09 +0100 > Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into > update_blocked_averages() > > There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq > list in case there are no se's enqueued on it. > > The functionality had been initially introduced by commit 82958366cfea > ("sched: Replace update_shares weight distribution with per-entity > computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair: > Rewrite runnable load and utilization average tracking"). > > Signed-off-by: Dietmar Eggemann > --- > kernel/sched/fair.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index dd530359ef84..951c83337e2b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu) > update_tg_load_avg(cfs_rq, 0); > > /* Propagate pending load changes to the parent */ > - if (cfs_rq->tg->se[cpu]) > + if (cfs_rq->tg->se[cpu]) { > update_load_avg(cfs_rq->tg->se[cpu], 0, 0); > + > + if (!cfs_rq->nr_running) > + list_del_leaf_cfs_rq(cfs_rq); > + } > } > raw_spin_unlock_irqrestore(&rq->lock, flags); > } > -- > 1.9.1