Hi Peter, The attach/detach twice problem is worse than Vincent reported. The attach twice issue can happen not only as Vincent raised when task moves between groups, but also when switching to fair class. In addition, for newly forked task, the detach also has a problem. This patchset attempts to address all of those problems. Thanks a lot to Vincent and Peter. This new version addresses their comments to reword the changelog and comments. Also thanks to Matt for the suggestion to fix the newly forked task's detach problem easier. Thanks, Yuyang -- Yuyang Du (5): sched/fair: Clean up attach_entity_load_avg() sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group sched/fair: Move load and util avgs from wake_up_new_task() to sched_fork() sched/fair: Skip detach sched avgs for new task when changing task groups sched/fair: Add inline to detach_entity_load_evg() kernel/sched/core.c | 5 +-- kernel/sched/fair.c | 96 ++++++++++++++++++++++++++++++-------------------- kernel/sched/sched.h | 2 +- 3 files changed, 61 insertions(+), 42 deletions(-) -- 1.7.9.5
attach_entity_load_avg() is called (indirectly) from: - switched_to_fair(): switch between classes to fair - task_move_group_fair(): move between task groups - enqueue_entity_load_avg(): enqueue entity Only in switched_to_fair() is it possible that the task's last_update_time is not 0 and therefore the task needs sched avgs update, so move the task sched avgs update to switched_to_fair() only. In addition, the code is refactored and code comments are updated. No functionality change. Signed-off-by: Yuyang Du <yuyang.du@intel.com> --- kernel/sched/fair.c | 43 ++++++++++++++++++++----------------------- 1 file changed, 20 insertions(+), 23 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6dd8ba..34e658b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2935,24 +2935,6 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg) static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (!sched_feat(ATTACH_AGE_LOAD)) - goto skip_aging; - - /* - * If we got migrated (either between CPUs or between cgroups) we'll - * have aged the average right before clearing @last_update_time. - */ - if (se->avg.last_update_time) { - __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), - &se->avg, 0, 0, NULL); - - /* - * XXX: we could have just aged the entire load away if we've been - * absent from the fair class for too long. - */ - } - -skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum; @@ -2962,6 +2944,19 @@ skip_aging: cfs_rq_util_change(cfs_rq); } +static inline void attach_age_load_task(struct rq *rq, struct task_struct *p) +{ + struct sched_entity *se = &p->se; + + if (!sched_feat(ATTACH_AGE_LOAD)) + return; + + if (se->avg.last_update_time) { + __update_load_avg(cfs_rq_of(se)->avg.last_update_time, cpu_of(rq), + &se->avg, 0, 0, NULL); + } +} + static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), @@ -3091,6 +3086,7 @@ static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} static inline void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} +static inline void attach_age_load_task(struct rq *rq, struct task_struct *p) {} static inline int idle_balance(struct rq *rq) { @@ -8390,6 +8386,12 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p) static void switched_to_fair(struct rq *rq, struct task_struct *p) { + /* + * If we change between classes, age the averages before attaching them. + * XXX: we could have just aged the entire load away if we've been + * absent from the fair class for too long. + */ + attach_age_load_task(rq, p); attach_task_cfs_rq(p); if (task_on_rq_queued(p)) { @@ -8441,11 +8443,6 @@ static void task_move_group_fair(struct task_struct *p) { detach_task_cfs_rq(p); set_task_rq(p, task_cpu(p)); - -#ifdef CONFIG_SMP - /* Tell se's cfs_rq has been changed -- migrated */ - p->se.avg.last_update_time = 0; -#endif attach_task_cfs_rq(p); } -- 1.7.9.5
Vincent reported that the first task to a new task group's cfs_rq will be attached in attach_task_cfs_rq() and once more when it is enqueued (see https://lkml.org/lkml/2016/5/25/388). Actually, it is worse. The sched avgs can be sometimes attached twice not only when we change task groups but also when we switch to fair class. These two scenarios will be described in the following respectively. (1) Switch to fair class: The sched class change is done like this: if (queued) enqueue_task(); check_class_changed() switched_from() switched_to() If the task is on_rq, before switched_to(), it has been enqueued, which already attached sched avgs to the cfs_rq if the task's last_update_time is 0, which can happen if the task was never fair class, if so, we shouldn't attach it again in switched_to(), otherwise, we attach it twice. To address both the on_rq and !on_rq cases, as well as both the task was switched from fair and otherwise, the simplest solution is to reset the task's last_update_time to 0 when the task is switched from fair. Then let task enqueue do the sched avgs attachment uniformly only once. (2) Change between fair task groups: The task groups are changed like this: if (queued) dequeue_task() task_move_group() if (queued) enqueue_task() Unlike the switch to fair class case, if the task is on_rq, it will be enqueued right away after we move task groups, and if not, in the future when the task is runnable. The attach twice problem can happen if the cfs_rq and the task are both new as Vincent discovered. The simplest solution is to only reset the task's last_update_time in task_move_group(), and then let enqueue_task() do the sched avgs attachment. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Yuyang Du <yuyang.du@intel.com> --- kernel/sched/fair.c | 71 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 46 insertions(+), 25 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 34e658b..68fcd2e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2933,7 +2933,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg) update_tg_load_avg(cfs_rq, 0); } -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) +/* Synchronize task with its cfs_rq */ +static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { se->avg.last_update_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; @@ -2944,19 +2945,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s cfs_rq_util_change(cfs_rq); } -static inline void attach_age_load_task(struct rq *rq, struct task_struct *p) -{ - struct sched_entity *se = &p->se; - - if (!sched_feat(ATTACH_AGE_LOAD)) - return; - - if (se->avg.last_update_time) { - __update_load_avg(cfs_rq_of(se)->avg.last_update_time, cpu_of(rq), - &se->avg, 0, 0, NULL); - } -} - static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), @@ -3031,6 +3019,11 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq) } #endif +static inline void reset_task_last_update_time(struct task_struct *p) +{ + p->se.avg.last_update_time = 0; +} + /* * Task first catches up with cfs_rq, and then subtract * itself from the cfs_rq (task must be off the queue now). @@ -3083,10 +3076,8 @@ dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} static inline void remove_entity_load_avg(struct sched_entity *se) {} static inline void -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} -static inline void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} -static inline void attach_age_load_task(struct rq *rq, struct task_struct *p) {} +static inline void reset_task_last_update_time(struct task_struct *p) {} static inline int idle_balance(struct rq *rq) { @@ -8372,8 +8363,19 @@ static void attach_task_cfs_rq(struct task_struct *p) se->depth = se->parent ? se->parent->depth + 1 : 0; #endif - /* Synchronize task with its cfs_rq */ - attach_entity_load_avg(cfs_rq, se); + /* + * At this point, we don't do sched avgs attachment. Instead, + * we uniformly do the attachement at enqueue time in all + * scenarios: + * + * (1) Actually, we may have already done it (e.g., at enqueue_task() + * in __sched_setscheduler() when a task switches to fair class). + * + * (2) We will do it right away (e.g., in sched_move_task() when + * an on_rq task moves between groups) + * + * (3) In the future when this currently !on_rq task is enqueued. + */ if (!vruntime_normalized(p)) se->vruntime += cfs_rq->min_vruntime; @@ -8382,16 +8384,30 @@ static void attach_task_cfs_rq(struct task_struct *p) static void switched_from_fair(struct rq *rq, struct task_struct *p) { detach_task_cfs_rq(p); + /* + * If the task changes back to fair class, we will attach the + * task's sched avgs when it is enqueued, as the task's last_update_time + * is reset to 0. + * + * This simplifies sched avgs attachment when a task is switched + * to fair class, as we can unify the case where the task was + * never fair class and the case where the task was fair class. + * + * Having lost last_update_time, we can't age the task's sched avgs + * before attaching them, so the last updated sched avgs (in the + * above detach_task_cfs_rq()) will be used. + * + * Typically the time between switched_from_fair() and switched_to_fair() + * most likely could have just aged the entire load/util away. + * Despite that, there is no better way to account for the lost + * record of sched avgs during that time, therefore, we simply + * lean toward no aging at all. + */ + reset_task_last_update_time(p); } static void switched_to_fair(struct rq *rq, struct task_struct *p) { - /* - * If we change between classes, age the averages before attaching them. - * XXX: we could have just aged the entire load away if we've been - * absent from the fair class for too long. - */ - attach_age_load_task(rq, p); attach_task_cfs_rq(p); if (task_on_rq_queued(p)) { @@ -8444,6 +8460,11 @@ static void task_move_group_fair(struct task_struct *p) detach_task_cfs_rq(p); set_task_rq(p, task_cpu(p)); attach_task_cfs_rq(p); + /* + * This ensures we will attach the sched avgs when the task is + * enqueued as the task's last_update_time is reset to 0. + */ + reset_task_last_update_time(p); } void free_fair_sched_group(struct task_group *tg) -- 1.7.9.5
Move new task initialization to sched_fork(). For initial non-fair class task, the first switched_to_fair() will do the attach correctly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yuyang Du <yuyang.du@intel.com> --- kernel/sched/core.c | 5 +++-- kernel/sched/fair.c | 14 +++++--------- kernel/sched/sched.h | 2 +- 3 files changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f2cae4..5d72567 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2370,6 +2370,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) if (p->sched_class->task_fork) p->sched_class->task_fork(p); + /* Initialize new task's sched averages */ + init_entity_sched_avg(&p->se); + /* * The child is not yet in the pid-hash so no cgroup attach races, * and the cgroup is pinned to this child due to cgroup_fork() @@ -2510,8 +2513,6 @@ void wake_up_new_task(struct task_struct *p) struct rq_flags rf; struct rq *rq; - /* Initialize new task's runnable average */ - init_entity_runnable_average(&p->se); raw_spin_lock_irqsave(&p->pi_lock, rf.flags); #ifdef CONFIG_SMP /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 68fcd2e..6e870c6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -668,8 +668,8 @@ static unsigned long task_h_load(struct task_struct *p); #define LOAD_AVG_MAX 47742 /* maximum possible load avg */ #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */ -/* Give new sched_entity start runnable values to heavy its load in infant time */ -void init_entity_runnable_average(struct sched_entity *se) +/* Give new sched_entity start load values to heavy its load in infant time */ +void init_entity_sched_avg(struct sched_entity *se) { struct sched_avg *sa = &se->avg; @@ -738,12 +738,8 @@ void post_init_entity_util_avg(struct sched_entity *se) static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq); static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq); #else -void init_entity_runnable_average(struct sched_entity *se) -{ -} -void post_init_entity_util_avg(struct sched_entity *se) -{ -} +void init_entity_sched_avg(struct sched_entity *se) { } +void post_init_entity_util_avg(struct sched_entity *se) { } #endif /* @@ -8514,7 +8510,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_rq(cfs_rq); init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]); - init_entity_runnable_average(se); + init_entity_sched_avg(se); post_init_entity_util_avg(se); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 72f1f30..6087950 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1321,7 +1321,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se); unsigned long to_ratio(u64 period, u64 runtime); -extern void init_entity_runnable_average(struct sched_entity *se); +extern void init_entity_sched_avg(struct sched_entity *se); extern void post_init_entity_util_avg(struct sched_entity *se); #ifdef CONFIG_NO_HZ_FULL -- 1.7.9.5
Newly forked task has not been enqueued, so should not be removed from cfs_rq in task_move_group_fair(). To do so, we identify newly forked tasks by their sched_avg's last_update_time in detach_entity_load_avg(). Signed-off-by: Yuyang Du <yuyang.du@intel.com> --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e870c6..e0f219b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2943,6 +2943,10 @@ static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_en static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { + /* Newly forked tasks are not attached yet. */ + if (!se->avg.last_update_time) + return; + __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL); -- 1.7.9.5
detach_entity_load_evg() is only called by detach_task_cfs_rq(), so explicitly add inline attribute to it. Signed-off-by: Yuyang Du <yuyang.du@intel.com> --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e0f219b..2bbda0d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2941,7 +2941,8 @@ static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_en cfs_rq_util_change(cfs_rq); } -static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) +/* Catch up with the cfs_rq and remove our load when we leave */ +static inline void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { /* Newly forked tasks are not attached yet. */ if (!se->avg.last_update_time) @@ -8346,7 +8347,6 @@ static void detach_task_cfs_rq(struct task_struct *p) se->vruntime -= cfs_rq->min_vruntime; } - /* Catch up with the cfs_rq and remove our load when we leave */ detach_entity_load_avg(cfs_rq, se); } -- 1.7.9.5
On Thu, 09 Jun, at 07:15:50AM, Yuyang Du wrote:
> attach_entity_load_avg() is called (indirectly) from:
>
> - switched_to_fair(): switch between classes to fair
> - task_move_group_fair(): move between task groups
> - enqueue_entity_load_avg(): enqueue entity
>
> Only in switched_to_fair() is it possible that the task's last_update_time
> is not 0 and therefore the task needs sched avgs update, so move the task
> sched avgs update to switched_to_fair() only. In addition, the code is
> refactored and code comments are updated.
>
> No functionality change.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
> kernel/sched/fair.c | 43 ++++++++++++++++++++-----------------------
> 1 file changed, 20 insertions(+), 23 deletions(-)
Looks OK to me and makes the code easier to understand. Chasing
->avg.last_update_time values is tricky at the best of times.
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
On Thu, 09 Jun, at 07:15:53AM, Yuyang Du wrote:
> Newly forked task has not been enqueued, so should not be removed from
> cfs_rq in task_move_group_fair(). To do so, we identify newly forked
> tasks by their sched_avg's last_update_time in detach_entity_load_avg().
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e870c6..e0f219b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2943,6 +2943,10 @@ static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_en
>
> static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + /* Newly forked tasks are not attached yet. */
> + if (!se->avg.last_update_time)
> + return;
> +
Now that we no longer have the @fork parameter to provide context,
this comment would benefit from some expansion. What about something
like this,
/*
* Newly forked tasks that are being moved between groups
* haven't been enqueued anywhere yet, and don't need detaching.
*/
?
On Tue, Jun 14, 2016 at 12:11:52PM +0100, Matt Fleming wrote:
> On Thu, 09 Jun, at 07:15:50AM, Yuyang Du wrote:
> > attach_entity_load_avg() is called (indirectly) from:
> >
> > - switched_to_fair(): switch between classes to fair
> > - task_move_group_fair(): move between task groups
> > - enqueue_entity_load_avg(): enqueue entity
> >
> > Only in switched_to_fair() is it possible that the task's last_update_time
> > is not 0 and therefore the task needs sched avgs update, so move the task
> > sched avgs update to switched_to_fair() only. In addition, the code is
> > refactored and code comments are updated.
> >
> > No functionality change.
> >
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> > ---
> > kernel/sched/fair.c | 43 ++++++++++++++++++++-----------------------
> > 1 file changed, 20 insertions(+), 23 deletions(-)
>
> Looks OK to me and makes the code easier to understand. Chasing
> ->avg.last_update_time values is tricky at the best of times.
>
> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
So I still wonder at the point of this patch, since the next patch
deletes this code entirely. What's the point of cleaning it up if the
next patch removes it out-right.
On Thu, Jun 09, 2016 at 07:15:53AM +0800, Yuyang Du wrote: > Newly forked task has not been enqueued, so should not be removed from > cfs_rq in task_move_group_fair(). To do so, we identify newly forked > tasks by their sched_avg's last_update_time in detach_entity_load_avg(). > static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > + /* Newly forked tasks are not attached yet. */ > + if (!se->avg.last_update_time) > + return; Urgh, so this results in two different heuristics to detect 'new' tasks and gives two different meanings to !last_update_time. How about you use the existing heuristic as per vruntime_normalized() and do: if (!se->sum_exec_runtime) return;
On Tue, Jun 14, 2016 at 04:36:49PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 09, 2016 at 07:15:53AM +0800, Yuyang Du wrote:
> > Newly forked task has not been enqueued, so should not be removed from
> > cfs_rq in task_move_group_fair(). To do so, we identify newly forked
> > tasks by their sched_avg's last_update_time in detach_entity_load_avg().
>
> > static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > + /* Newly forked tasks are not attached yet. */
> > + if (!se->avg.last_update_time)
> > + return;
>
> Urgh, so this results in two different heuristics to detect 'new' tasks
> and gives two different meanings to !last_update_time.
>
> How about you use the existing heuristic as per vruntime_normalized()
> and do:
>
> if (!se->sum_exec_runtime)
> return;
Hurm,. I see we already have this confusion as per
remove_entity_load_avg(). Could we fix it there too?
On Tue, Jun 14, 2016 at 12:11:52PM +0100, Matt Fleming wrote:
> On Thu, 09 Jun, at 07:15:50AM, Yuyang Du wrote:
> > attach_entity_load_avg() is called (indirectly) from:
> >
> > - switched_to_fair(): switch between classes to fair
> > - task_move_group_fair(): move between task groups
> > - enqueue_entity_load_avg(): enqueue entity
> >
> > Only in switched_to_fair() is it possible that the task's last_update_time
> > is not 0 and therefore the task needs sched avgs update, so move the task
> > sched avgs update to switched_to_fair() only. In addition, the code is
> > refactored and code comments are updated.
> >
> > No functionality change.
> >
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> > ---
> > kernel/sched/fair.c | 43 ++++++++++++++++++++-----------------------
> > 1 file changed, 20 insertions(+), 23 deletions(-)
>
> Looks OK to me and makes the code easier to understand. Chasing
> ->avg.last_update_time values is tricky at the best of times.
>
> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Thanks, Matt.