Re: [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>,
	peterz@infradead.org, mingo@kernel.org,
	linux-kernel@vger.kernel.org
Cc: yuyang.du@intel.com, Morten.Rasmussen@arm.com, pjt@google.com,
	bsegall@google.com, kernellwp@gmail.com
Subject: Re: [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load
Date: Thu, 10 Nov 2016 17:04:11 +0000	[thread overview]
Message-ID: <85a264ef-3879-d0a9-ca7a-5f9e746cfd60@arm.com> (raw)
In-Reply-To: <1478598827-32372-1-git-send-email-vincent.guittot@linaro.org>

On 08/11/16 09:53, Vincent Guittot wrote:
> Ensure that the move of a sched_entity will be reflected in load and
> utilization of the task_group hierarchy.
> 
> When a sched_entity moves between groups or CPUs, load and utilization
> of cfs_rq don't reflect the changes immediately but converge to new values.
> As a result, the metrics are no more aligned with the new balance of the
> load in the system and next decisions will have a biased view.
> 
> This patchset synchronizes load/utilization of sched_entity with its child
> cfs_rq (se->my-q) only when tasks move to/from child cfs_rq:
> -move between task group
> -migration between CPUs
> Otherwise, PELT is updated as usual.
> 
> This version doesn't include any changes related to discussion that have
> started during the review of the previous version about:
> - encapsulate the sequence for changing the property of a task
> - remove a cfs_rq from list during update_blocked_averages  
> These topics don't gain anything from being added in this patchset as they
> are fairly independent and deserve a separate patch. 

Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

I tested your v7 against tip with a synthetic workload (one task,
run/period = 8ms/16ms) running in a third level task group:
tg_root/tgX/tgXX/tgXXX and switching every 160ms between cpus (cpu1 <->
cpu2) or sched classes (fair<->rt) or task groups
(tg_root/tg1/tg11/tg111 <-> tg_root/tg2/tg21/tg211).

I shared the results under:

https://drive.google.com/drive/folders/0B2f-ZAwV_YnmYUM4X0NOOXZxdkk

The html files contain diagrams showing the value of the individual PELT
signals (util_avg, load_avg, runnable_load_avg, tg_load_avg_contrib) on
cfs_rq's and se's as well as tg's involved. The green signals are tip,
the blue signals are v7.

The diagrams show the aimed behaviour of propagating PELT changes down
the tg hierarchy.
The only small issue is the load_avg, runnable_load_avg,
tg_load_avg_contrib signals in the 'switching between cpus' case
'20161110_reflect_v7_3rd_level_pelt_switch_between_cpus.html'.

The cfs_rq:load_avg signal [cell [11]] is ~500 -> ~500 -> ~350 -> ~300
for tg111 (tg_css_id = 4) -> tg11 (tg_css_id = 3) -> tg1 (tg_css_id = 2)
-> tg_root (tg_css_id = 1) where I expected it to be ~500 on all tg levels.
Since this problem only occurs in the 'switching cpu' test case and only
for load (not utilization) and since the value for the initial run on a
cpu is ~500 all the way down (it only drops once we switched at least
one time) it probably has something to do how we calculate shares and/or
missing PELT updates on idle cpus. But IMHO we can investigate this later.

> Changes since v6:
> -fix warning and error raised by lkp
> 
> Changes since v5:
> - factorize detach entity like for attach
> - fix add_positive
> - Fixed few coding style
> 
> Changes since v4:
> - minor typo and commit message changes
> - move call to cfs_rq_clock_task(cfs_rq) in post_init_entity_util_avg
> 
> Changes since v3:
> - Replaced the 2 arguments of update_load_avg by 1 flags argument
> - Propagated move in runnable_load_avg when sched_entity is already on_rq
> - Ensure that intermediate value will not reach memory when updating load and
>   utilization
> - Optimize the the calculation of load_avg of the sched_entity
> - Fixed some typo
> 
> Changes since v2:
> - Propagate both utilization and load
> - Synced sched_entity and se->my_q instead of adding the delta
> 
> Changes since v1:
> - This patch needs the patch that fixes issue with rq->leaf_cfs_rq_list
>   "sched: fix hierarchical order in rq->leaf_cfs_rq_list" in order to work
>   correctly. I haven't sent them as a single patchset because the fix is
>   independent of this one
> - Merge some functions that are always used together
> - During update of blocked load, ensure that the sched_entity is synced
>   with the cfs_rq applying changes
> - Fix an issue when task changes its cpu affinity
> 
> Vincent Guittot (6):
>   sched: factorize attach/detach entity
>   sched: fix hierarchical order in rq->leaf_cfs_rq_list
>   sched: factorize PELT update
>   sched: propagate load during synchronous attach/detach
>   sched: propagate asynchrous detach
>   sched: fix task group initialization
> 
>  kernel/sched/core.c  |   1 +
>  kernel/sched/fair.c  | 395 ++++++++++++++++++++++++++++++++++++++++-----------
>  kernel/sched/sched.h |   2 +
>  3 files changed, 318 insertions(+), 80 deletions(-)
>