linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/4] sched/fair: Fix attach and detach sched avgs for task group change and sched class change
@ 2016-06-14 22:21 Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group Yuyang Du
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Yuyang Du @ 2016-06-14 22:21 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

Hi Peter,

The attach/detach twice problem is worse than Vincent reported.

The attach twice issue can happen not only as Vincent raised when task moves
between groups, but also when switching to fair class. In addition, for newly
forked task, the detach also has a problem. This patchset attempts to address
all of those problems.

Thanks a lot to Vincent and Peter. This new version addresses their comments
to reword the changelog and comments. Also thanks to Matt for the suggestion
to fix the newly forked task's detach problem easier.

This version also incorporates Peter's suggestion to unify the 'new' task test
in vruntime_normalized(), detach_entity_load_avg(), and remove_entity_load_avg().

Thanks,
Yuyang

--

Yuyang Du (4):
  sched/fair: Fix attaching task sched avgs twice when switching to
    fair or changing task group
  sched/fair: Move load and util avgs from wake_up_new_task() to
    sched_fork()
  sched/fair: Skip detach sched avgs for new task when changing task
    groups
  sched/fair: Add inline to detach_entity_load_evg()

 kernel/sched/core.c  |    5 +--
 kernel/sched/fair.c  |   98 +++++++++++++++++++++++++++++---------------------
 kernel/sched/sched.h |    2 +-
 3 files changed, 62 insertions(+), 43 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-14 22:21 [PATCH v6 0/4] sched/fair: Fix attach and detach sched avgs for task group change and sched class change Yuyang Du
@ 2016-06-14 22:21 ` Yuyang Du
  2016-06-15  7:46   ` Vincent Guittot
  2016-06-14 22:21 ` [PATCH v6 2/4] sched/fair: Move load and util avgs from wake_up_new_task() to sched_fork() Yuyang Du
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-06-14 22:21 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

Vincent reported that the first task to a new task group's cfs_rq will
be attached in attach_task_cfs_rq() and once more when it is enqueued
(see https://lkml.org/lkml/2016/5/25/388).

Actually, it is worse. The sched avgs can be sometimes attached twice
not only when we change task groups but also when we switch to fair class.
These two scenarios will be described in the following respectively.

(1) Switch to fair class:

The sched class change is done like this:

	if (queued)
	  enqueue_task();
	check_class_changed()
	  switched_from()
	  switched_to()

If the task is on_rq, before switched_to(), it has been enqueued, which
already attached sched avgs to the cfs_rq if the task's last_update_time
is 0, which can happen if the task was never fair class, if so, we
shouldn't attach it again in switched_to(), otherwise, we attach it twice.

To address both the on_rq and !on_rq cases, as well as both the task
was switched from fair and otherwise, the simplest solution is to reset
the task's last_update_time to 0 when the task is switched from fair.
Then let task enqueue do the sched avgs attachment uniformly only once.

(2) Change between fair task groups:

The task groups are changed like this:

	if (queued)
          dequeue_task()
	task_move_group()
	if (queued)
          enqueue_task()

Unlike the switch to fair class case, if the task is on_rq, it will be
enqueued right away after we move task groups, and if not, in the future
when the task is runnable. The attach twice problem can happen if the
cfs_rq and the task are both new as Vincent discovered. The simplest
solution is to only reset the task's last_update_time in task_move_group(),
and then let enqueue_task() do the sched avgs attachment.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   74 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f75930b..fa2c1d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2933,26 +2933,9 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 		update_tg_load_avg(cfs_rq, 0);
 }
 
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/* Synchronize task with its cfs_rq */
+static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		goto skip_aging;
-
-	/*
-	 * If we got migrated (either between CPUs or between cgroups) we'll
-	 * have aged the average right before clearing @last_update_time.
-	 */
-	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
-
-		/*
-		 * XXX: we could have just aged the entire load away if we've been
-		 * absent from the fair class for too long.
-		 */
-	}
-
-skip_aging:
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se->avg.load_sum;
@@ -3036,6 +3019,11 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 }
 #endif
 
+static inline void reset_task_last_update_time(struct task_struct *p)
+{
+	p->se.avg.last_update_time = 0;
+}
+
 /*
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
@@ -3088,9 +3076,8 @@ dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
 
 static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void
 detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void reset_task_last_update_time(struct task_struct *p) {}
 
 static inline int idle_balance(struct rq *rq)
 {
@@ -8376,8 +8363,19 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	se->depth = se->parent ? se->parent->depth + 1 : 0;
 #endif
 
-	/* Synchronize task with its cfs_rq */
-	attach_entity_load_avg(cfs_rq, se);
+	/*
+	 * At this point, we don't do sched avgs attachment. Instead,
+	 * we uniformly do the attachement at enqueue time in all
+	 * scenarios:
+	 *
+	 * (1) Actually, we may have already done it (e.g., at enqueue_task()
+	 * in __sched_setscheduler() when a task switches to fair class).
+	 *
+	 * (2) We will do it right away (e.g., in sched_move_task() when
+	 * an on_rq task moves between groups)
+	 *
+	 * (3) In the future when this currently !on_rq task is enqueued.
+	 */
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
@@ -8386,6 +8384,26 @@ static void attach_task_cfs_rq(struct task_struct *p)
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
+	/*
+	 * If the task changes back to fair class, we will attach the
+	 * task's sched avgs when it is enqueued, as the task's last_update_time
+	 * is reset to 0.
+	 *
+	 * This simplifies sched avgs attachment when a task is switched
+	 * to fair class, as we can unify the case where the task was
+	 * never fair class and the case where the task was fair class.
+	 *
+	 * Having lost last_update_time, we can't age the task's sched avgs
+	 * before attaching them, so the last updated sched avgs (in the
+	 * above detach_task_cfs_rq()) will be used.
+	 *
+	 * Typically the time between switched_from_fair() and switched_to_fair()
+	 * most likely could have just aged the entire load/util away.
+	 * Despite that, there is no better way to account for the lost
+	 * record of sched avgs during that time, therefore, we simply
+	 * lean toward no aging at all.
+	 */
+	reset_task_last_update_time(p);
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
@@ -8441,12 +8459,12 @@ static void task_move_group_fair(struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
 	set_task_rq(p, task_cpu(p));
-
-#ifdef CONFIG_SMP
-	/* Tell se's cfs_rq has been changed -- migrated */
-	p->se.avg.last_update_time = 0;
-#endif
 	attach_task_cfs_rq(p);
+	/*
+	 * This ensures we will attach the sched avgs when the task is
+	 * enqueued as the task's last_update_time is reset to 0.
+	 */
+	reset_task_last_update_time(p);
 }
 
 void free_fair_sched_group(struct task_group *tg)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 2/4] sched/fair: Move load and util avgs from wake_up_new_task() to sched_fork()
  2016-06-14 22:21 [PATCH v6 0/4] sched/fair: Fix attach and detach sched avgs for task group change and sched class change Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group Yuyang Du
@ 2016-06-14 22:21 ` Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 3/4] sched/fair: Skip detach sched avgs for new task when changing task groups Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 4/4] sched/fair: Add inline to detach_entity_load_evg() Yuyang Du
  3 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-06-14 22:21 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

Move new task initialization to sched_fork(). For initial non-fair class
task, the first switched_to_fair() will do the attach correctly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    5 +++--
 kernel/sched/fair.c  |   14 +++++---------
 kernel/sched/sched.h |    2 +-
 3 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 31c30e5..6b7d0b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2384,6 +2384,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 
+	/* Initialize new task's sched averages */
+	init_entity_sched_avg(&p->se);
+
 	/*
 	 * The child is not yet in the pid-hash so no cgroup attach races,
 	 * and the cgroup is pinned to this child due to cgroup_fork()
@@ -2524,8 +2527,6 @@ void wake_up_new_task(struct task_struct *p)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	/* Initialize new task's runnable average */
-	init_entity_runnable_average(&p->se);
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 #ifdef CONFIG_SMP
 	/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa2c1d1..3a28922 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -668,8 +668,8 @@ static unsigned long task_h_load(struct task_struct *p);
 #define LOAD_AVG_MAX 47742 /* maximum possible load avg */
 #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
 
-/* Give new sched_entity start runnable values to heavy its load in infant time */
-void init_entity_runnable_average(struct sched_entity *se)
+/* Give new sched_entity start load values to heavy its load in infant time */
+void init_entity_sched_avg(struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
 
@@ -738,12 +738,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
 #else
-void init_entity_runnable_average(struct sched_entity *se)
-{
-}
-void post_init_entity_util_avg(struct sched_entity *se)
-{
-}
+void init_entity_sched_avg(struct sched_entity *se) { }
+void post_init_entity_util_avg(struct sched_entity *se) { }
 #endif
 
 /*
@@ -8517,7 +8513,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 		init_cfs_rq(cfs_rq);
 		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
-		init_entity_runnable_average(se);
+		init_entity_sched_avg(se);
 
 		raw_spin_lock_irq(&rq->lock);
 		post_init_entity_util_avg(se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf6fea9..1b1d439 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1321,7 +1321,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-extern void init_entity_runnable_average(struct sched_entity *se);
+extern void init_entity_sched_avg(struct sched_entity *se);
 extern void post_init_entity_util_avg(struct sched_entity *se);
 
 #ifdef CONFIG_NO_HZ_FULL
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 3/4] sched/fair: Skip detach sched avgs for new task when changing task groups
  2016-06-14 22:21 [PATCH v6 0/4] sched/fair: Fix attach and detach sched avgs for task group change and sched class change Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 2/4] sched/fair: Move load and util avgs from wake_up_new_task() to sched_fork() Yuyang Du
@ 2016-06-14 22:21 ` Yuyang Du
  2016-06-14 22:21 ` [PATCH v6 4/4] sched/fair: Add inline to detach_entity_load_evg() Yuyang Du
  3 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-06-14 22:21 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

Newly forked task has not been enqueued, so should not be removed from
cfs_rq in task_move_group_fair(). To do so, we identify newly forked
tasks if their sum_exec_runtime is 0, an existing heuristic as per
vruntime_normalized(). In addition to that, uniformly use this test
in remove_entity_load_avg().

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a28922..e9f2754 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2943,6 +2943,10 @@ static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_en
 
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	/* Newly forked tasks are not attached yet. */
+	if (!se->sum_exec_runtime)
+		return;
+
 	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
@@ -3033,7 +3037,7 @@ void remove_entity_load_avg(struct sched_entity *se)
 	 * Newly created task or never used group entity should not be removed
 	 * from its (source) cfs_rq
 	 */
-	if (se->avg.last_update_time == 0)
+	if (!se->sum_exec_runtime)
 		return;
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 4/4] sched/fair: Add inline to detach_entity_load_evg()
  2016-06-14 22:21 [PATCH v6 0/4] sched/fair: Fix attach and detach sched avgs for task group change and sched class change Yuyang Du
                   ` (2 preceding siblings ...)
  2016-06-14 22:21 ` [PATCH v6 3/4] sched/fair: Skip detach sched avgs for new task when changing task groups Yuyang Du
@ 2016-06-14 22:21 ` Yuyang Du
  3 siblings, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-06-14 22:21 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: umgwanakikbuti, bsegall, pjt, morten.rasmussen, vincent.guittot,
	dietmar.eggemann, matt, Yuyang Du

detach_entity_load_evg() is only called by detach_task_cfs_rq(), so
explicitly add inline attribute to it.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9f2754..b5fba70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2941,7 +2941,8 @@ static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_en
 	cfs_rq_util_change(cfs_rq);
 }
 
-static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/* Catch up with the cfs_rq and remove our load when we leave */
+static inline void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	/* Newly forked tasks are not attached yet. */
 	if (!se->sum_exec_runtime)
@@ -8346,7 +8347,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
 		se->vruntime -= cfs_rq->min_vruntime;
 	}
 
-	/* Catch up with the cfs_rq and remove our load when we leave */
 	detach_entity_load_avg(cfs_rq, se);
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-15  7:46   ` Vincent Guittot
@ 2016-06-15  0:18     ` Yuyang Du
  2016-06-15 14:15       ` Vincent Guittot
  2016-06-15 15:22     ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-06-15  0:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
> I still have concerned with this change of the behavior that  attaches
> the task only when it is enqueued. The load avg of the task will not
> be decayed between the time we move it into its new group until its
> enqueue. With this change, a task's load can stay high whereas it has
> slept for the last couple of seconds.

The task will be updated when it is moved in the detach, so if it stays
high, it is highly expected to be enqueued soon. But if the task is never
enqueued, we don't bother the new gruop cfs_rq.

So this is not perfect, and nothing is perfect, but a simple solution.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-14 22:21 ` [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group Yuyang Du
@ 2016-06-15  7:46   ` Vincent Guittot
  2016-06-15  0:18     ` Yuyang Du
  2016-06-15 15:22     ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-06-15  7:46 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 15 June 2016 at 00:21, Yuyang Du <yuyang.du@intel.com> wrote:
> Vincent reported that the first task to a new task group's cfs_rq will
> be attached in attach_task_cfs_rq() and once more when it is enqueued
> (see https://lkml.org/lkml/2016/5/25/388).
>
> Actually, it is worse. The sched avgs can be sometimes attached twice
> not only when we change task groups but also when we switch to fair class.
> These two scenarios will be described in the following respectively.
>
> (1) Switch to fair class:
>
> The sched class change is done like this:
>
>         if (queued)
>           enqueue_task();
>         check_class_changed()
>           switched_from()
>           switched_to()
>
> If the task is on_rq, before switched_to(), it has been enqueued, which
> already attached sched avgs to the cfs_rq if the task's last_update_time
> is 0, which can happen if the task was never fair class, if so, we
> shouldn't attach it again in switched_to(), otherwise, we attach it twice.
>
> To address both the on_rq and !on_rq cases, as well as both the task
> was switched from fair and otherwise, the simplest solution is to reset
> the task's last_update_time to 0 when the task is switched from fair.
> Then let task enqueue do the sched avgs attachment uniformly only once.
>
> (2) Change between fair task groups:
>
> The task groups are changed like this:
>
>         if (queued)
>           dequeue_task()
>         task_move_group()
>         if (queued)
>           enqueue_task()
>
> Unlike the switch to fair class case, if the task is on_rq, it will be
> enqueued right away after we move task groups, and if not, in the future
> when the task is runnable. The attach twice problem can happen if the
> cfs_rq and the task are both new as Vincent discovered. The simplest
> solution is to only reset the task's last_update_time in task_move_group(),
> and then let enqueue_task() do the sched avgs attachment.

I still have concerned with this change of the behavior that  attaches
the task only when it is enqueued. The load avg of the task will not
be decayed between the time we move it into its new group until its
enqueue. With this change, a task's load can stay high whereas it has
slept for the last couple of seconds. Then, its load and utilization
is no more accounted anywhere in the mean time just because we have
moved the task which will be enqueued on the same rq.
A task should always be attached to a cfs_rq and its load/utilization
should always be accounted on a cfs_rq and decayed for its sleep
period

Regards,
Vincent


>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-15  0:18     ` Yuyang Du
@ 2016-06-15 14:15       ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-06-15 14:15 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 15 June 2016 at 02:18, Yuyang Du <yuyang.du@intel.com> wrote:
> On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
>> I still have concerned with this change of the behavior that  attaches
>> the task only when it is enqueued. The load avg of the task will not
>> be decayed between the time we move it into its new group until its
>> enqueue. With this change, a task's load can stay high whereas it has
>> slept for the last couple of seconds.
>
> The task will be updated when it is moved in the detach, so if it stays
> high, it is highly expected to be enqueued soon. But if the task is never

you can't do such assumption because if it's not enqueued soon (keep
in mind that a 10ms sleep will decay the value by 20% from max),
metrics become just wrong. Furthermore, the load metrics of the task
becomes different between a task being moved or the same task staying
on same cfs_rq.

> enqueued, we don't bother the new gruop cfs_rq.
>
> So this is not perfect, and nothing is perfect, but a simple solution.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-15  7:46   ` Vincent Guittot
  2016-06-15  0:18     ` Yuyang Du
@ 2016-06-15 15:22     ` Peter Zijlstra
  2016-06-16  1:00       ` Yuyang Du
  2016-06-16 16:30       ` Vincent Guittot
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-06-15 15:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
> I still have concerned with this change of the behavior that  attaches
> the task only when it is enqueued. The load avg of the task will not
> be decayed between the time we move it into its new group until its
> enqueue. With this change, a task's load can stay high whereas it has
> slept for the last couple of seconds. Then, its load and utilization
> is no more accounted anywhere in the mean time just because we have
> moved the task which will be enqueued on the same rq.
> A task should always be attached to a cfs_rq and its load/utilization
> should always be accounted on a cfs_rq and decayed for its sleep
> period

OK; so I think I agree with that. Does the below (completely untested,
hasn't even been near a compiler) look reasonable?

The general idea is to always attach to a cfs_rq -- through
post_init_entity_util_avg(). This covers both the new task isn't
attached yet and was never in the fair class to begin with issues.

That only leaves a tiny hole in fork() where the task is hashed but
hasn't yet passed through wake_up_new_task() in which someone can do
cgroup move on it. That is closed with TASK_NEW and can_attach()
refusing those tasks.

I think this covers all scenarios, but my brain is completely fried,
I'll go over it (and my many callgraph notes) again tomorrow.


(there's an optimization to the fork path squashed in here, I'll split
that out in a separate patch, it avoids calling set_task_cpu() twice
and doing the IRQ disable/enable dance twice).

---
 include/linux/sched.h |  5 +++--
 kernel/sched/core.c   | 25 +++++++++++++++++++------
 kernel/sched/fair.c   | 42 +++++++++++++++++-------------------------
 3 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e31758b962ec..648126572c34 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -220,9 +220,10 @@ extern void proc_sched_set_task(struct task_struct *p);
 #define TASK_WAKING		256
 #define TASK_PARKED		512
 #define TASK_NOLOAD		1024
-#define TASK_STATE_MAX		2048
+#define TASK_NEW		2048
+#define TASK_STATE_MAX		4096
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPNn"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 31c30e5bc338..de5522260dc6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2340,11 +2340,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	__sched_fork(clone_flags, p);
 	/*
-	 * We mark the process as running here. This guarantees that
+	 * We mark the process as NEW here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
 	 * event cannot wake it up and insert it on the runqueue either.
 	 */
-	p->state = TASK_RUNNING;
+	p->state = TASK_NEW;
 
 	/*
 	 * Make sure we do not leak PI boosting priority to the child.
@@ -2381,8 +2381,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->sched_class = &fair_sched_class;
 	}
 
-	if (p->sched_class->task_fork)
-		p->sched_class->task_fork(p);
+	init_entity_runnable_average(&p->se);
 
 	/*
 	 * The child is not yet in the pid-hash so no cgroup attach races,
@@ -2393,6 +2392,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	set_task_cpu(p, cpu);
+	if (p->sched_class->task_fork)
+		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 #ifdef CONFIG_SCHED_INFO
@@ -2524,9 +2525,8 @@ void wake_up_new_task(struct task_struct *p)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	/* Initialize new task's runnable average */
-	init_entity_runnable_average(&p->se);
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
+	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
 	/*
 	 * Fork balancing, do it here and not earlier because:
@@ -8220,6 +8220,19 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 		if (task->sched_class != &fair_sched_class)
 			return -EINVAL;
 #endif
+		/*
+		 * Serialize against wake_up_new_task() such
+		 * that if its running, we're sure to observe
+		 * its full state.
+		 */
+		raw_spin_unlock_wait(&task->pi_lock);
+		/*
+		 * Avoid calling sched_move_task() before wake_up_new_task()
+		 * has happened. This would lead to problems with PELT. See
+		 * XXX.
+		 */
+		if (task->state == TASK_NEW)
+			return -EINVAL;
 	}
 	return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f75930bdd326..8c5f59fc205a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -733,6 +733,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		}
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
+
+	attach_entity_load_avg(cfs_rq, se);
 }
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
@@ -2941,6 +2943,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	/*
 	 * If we got migrated (either between CPUs or between cgroups) we'll
 	 * have aged the average right before clearing @last_update_time.
+	 *
+	 * Or we're fresh through post_init_entity_util_avg().
 	 */
 	if (se->avg.last_update_time) {
 		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
@@ -3046,11 +3050,14 @@ void remove_entity_load_avg(struct sched_entity *se)
 	u64 last_update_time;
 
 	/*
-	 * Newly created task or never used group entity should not be removed
-	 * from its (source) cfs_rq
+	 * tasks cannot exit without having gone through wake_up_new_task() ->
+	 * post_init_entity_util_avg() which will have added things to the
+	 * cfs_rq, so we can remove unconditionally.
+	 *
+	 * Similarly for groups, they will have passed through
+	 * post_init_entity_util_avg() before unregister_sched_fair_group()
+	 * calls this.
 	 */
-	if (se->avg.last_update_time == 0)
-		return;
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
 
@@ -4418,7 +4425,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		 *
 		 * note: in the case of encountering a throttled cfs_rq we will
 		 * post the final h_nr_running increment below.
-		*/
+		 */
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 		cfs_rq->h_nr_running++;
@@ -8255,31 +8262,17 @@ static void task_fork_fair(struct task_struct *p)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se, *curr;
-	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
 
+	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
 
 	cfs_rq = task_cfs_rq(current);
 	curr = cfs_rq->curr;
-
-	/*
-	 * Not only the cpu but also the task_group of the parent might have
-	 * been changed after parent->se.parent,cfs_rq were copied to
-	 * child->se.parent,cfs_rq. So call __set_task_cpu() to make those
-	 * of child point to valid ones.
-	 */
-	rcu_read_lock();
-	__set_task_cpu(p, this_cpu);
-	rcu_read_unlock();
-
-	update_curr(cfs_rq);
-
-	if (curr)
+	if (curr) {
+		update_curr(cfs_rq);
 		se->vruntime = curr->vruntime;
+	}
 	place_entity(cfs_rq, se, 1);
 
 	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
@@ -8292,8 +8285,7 @@ static void task_fork_fair(struct task_struct *p)
 	}
 
 	se->vruntime -= cfs_rq->min_vruntime;
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock(&rq->lock);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-15 15:22     ` Peter Zijlstra
@ 2016-06-16  1:00       ` Yuyang Du
  2016-06-16 16:30       ` Vincent Guittot
  1 sibling, 0 replies; 21+ messages in thread
From: Yuyang Du @ 2016-06-16  1:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Wed, Jun 15, 2016 at 05:22:17PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
> > I still have concerned with this change of the behavior that  attaches
> > the task only when it is enqueued. The load avg of the task will not
> > be decayed between the time we move it into its new group until its
> > enqueue. With this change, a task's load can stay high whereas it has
> > slept for the last couple of seconds. Then, its load and utilization
> > is no more accounted anywhere in the mean time just because we have
> > moved the task which will be enqueued on the same rq.
> > A task should always be attached to a cfs_rq and its load/utilization
> > should always be accounted on a cfs_rq and decayed for its sleep
> > period
> 
> OK; so I think I agree with that.

Ok, I agree now. I think the following should fix (or sort out) the attach
twice problem Vincent discovered (SMP is needed, maybe move reset() into attach()).

static void task_move_group_fair(struct task_struct *p)
{
	detach_task_cfs_rq(p);
	set_task_rq(p, task_cpu(p));
	attach_task_cfs_rq(p);
	/*
	 * If the cfs_rq's last_update_time is 0, attach the sched avgs
	 * won't be anything useful, as it will be decayed to 0 when any
	 * sched_entity is enqueued to that cfs_rq.
	 *
	 * On the other hand, if the cfs_rq's last_update_time is 0, we
	 * must reset the task's last_update_time to ensure we will attach
	 * the sched avgs when the task is enqueued.
	 */
	if (!cfs_rq_of(&p->se)->avg.last_update_time)
		reset_task_last_update_time(p);
	else
		attach_entity_load_avg(cfs_rq_of(&p->se), &p->se);
}

Vincent, could you please verify?

> Does the below (completely untested,
> hasn't even been near a compiler) look reasonable?
 
Goodness. But mine is also fired now. :)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-15 15:22     ` Peter Zijlstra
  2016-06-16  1:00       ` Yuyang Du
@ 2016-06-16 16:30       ` Vincent Guittot
  2016-06-16 17:17         ` Peter Zijlstra
  2016-06-16 18:51         ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-06-16 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

Le Wednesday 15 Jun 2016 à 17:22:17 (+0200), Peter Zijlstra a écrit :
> On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
> > I still have concerned with this change of the behavior that  attaches
> > the task only when it is enqueued. The load avg of the task will not
> > be decayed between the time we move it into its new group until its
> > enqueue. With this change, a task's load can stay high whereas it has
> > slept for the last couple of seconds. Then, its load and utilization
> > is no more accounted anywhere in the mean time just because we have
> > moved the task which will be enqueued on the same rq.
> > A task should always be attached to a cfs_rq and its load/utilization
> > should always be accounted on a cfs_rq and decayed for its sleep
> > period
> 
> OK; so I think I agree with that. Does the below (completely untested,
> hasn't even been near a compiler) look reasonable?
> 
> The general idea is to always attach to a cfs_rq -- through
> post_init_entity_util_avg(). This covers both the new task isn't
> attached yet and was never in the fair class to begin with issues.

Your patch ensures that a task will be attached to a cfs_rq and fix the issue raised by Yuyang because of se->avg.last_update_time = 0 at init. During the test the following message has raised  "BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1" because of cfs_rq_util_change that is called in attach_entity_load_avg

With patch [1] for the init of cfs_rq side, all use cases will be covered regarding the issue linked to a last_update_time set to 0 at init
[1] https://lkml.org/lkml/2016/5/30/508

> 
> That only leaves a tiny hole in fork() where the task is hashed but
> hasn't yet passed through wake_up_new_task() in which someone can do
> cgroup move on it. That is closed with TASK_NEW and can_attach()
> refusing those tasks.
> 

But a new fair task is still detached and attached from/to task_group with cgroup_post_fork()-->ss->fork(child)-->cpu_cgroup_fork()-->sched_move_task()-->task_move_group_fair().
cpu_cgroup_can_attach is not used in this path and sched_move_task do the move unconditionally for fair task.

With your patch, we still have the sequence

sched_fork()
    set_task_cpu()
cgroup_post_fork()--> ... --> task_move_group_fair()
    detach_task_cfs_rq()
    set_task_rq()
    attach_task_cfs_rq()
wake_up_new_task()
    select_task_rq() can select a new cpu
    set_task_cpu()
        migrate_task_rq_fair if the new_cpu != task_cpu
             remove_load()
        __set_task_cpu
    post_init_entity_util_avg
        attach_task_cfs_rq()
    activate_task
        enqueue_task

In fact, cpu_cgroup_fork needs a small part of sched_move_task so we can just call this small part directly instead sched_move_task. And the task doesn't really migrate because it is not yet attached so we need the sequence:
sched_fork()
    __set_task_cpu()
cgroup_post_fork()--> ... --> task_move_group_fair()
    set_task_rq() to set task group and runqueue
wake_up_new_task()
    select_task_rq() can select a new cpu
    __set_task_cpu
    post_init_entity_util_avg
        attach_task_cfs_rq()
    activate_task
        enqueue_task

The patch below on top of your patch, ensures that we follow the right sequence :

---
 kernel/sched/core.c | 60 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7895689a..a21e3dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2373,7 +2373,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	 * Silence PROVE_RCU.
 	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	set_task_cpu(p, cpu);
+	__set_task_cpu(p, cpu);
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -2515,7 +2515,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	/* Post initialize new task's util average when its cfs_rq is set */
 	post_init_entity_util_avg(&p->se);
@@ -7715,6 +7715,35 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
+/* Set task's runqueue and group
+ *     In case of a move between group, we update src and dst group
+ *     thanks to sched_class->task_move_group. Otherwise, we just need to set
+ *     runqueue and group pointers. The task will be attached to the runqueue
+ *     during its wake up.
+ */
+static void sched_set_group(struct task_struct *tsk, bool move)
+{
+	struct task_group *tg;
+
+	/*
+	 * All callers are synchronized by task_rq_lock(); we do not use RCU
+	 * which is pointless here. Thus, we pass "true" to task_css_check()
+	 * to prevent lockdep warnings.
+	 */
+	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (move && tsk->sched_class->task_move_group)
+		tsk->sched_class->task_move_group(tsk);
+	else
+#endif
+		set_task_rq(tsk, task_cpu(tsk));
+
+}
+
 /* change task's runqueue when it moves between groups.
  *	The caller of this function should have put the task in its new group
  *	by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
@@ -7722,7 +7751,6 @@ void sched_offline_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
-	struct task_group *tg;
 	int queued, running;
 	struct rq_flags rf;
 	struct rq *rq;
@@ -7737,22 +7765,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		put_prev_task(rq, tsk);
 
-	/*
-	 * All callers are synchronized by task_rq_lock(); we do not use RCU
-	 * which is pointless here. Thus, we pass "true" to task_css_check()
-	 * to prevent lockdep warnings.
-	 */
-	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
-			  struct task_group, css);
-	tg = autogroup_task_group(tsk, tg);
-	tsk->sched_task_group = tg;
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (tsk->sched_class->task_move_group)
-		tsk->sched_class->task_move_group(tsk);
-	else
-#endif
-		set_task_rq(tsk, task_cpu(tsk));
+	sched_set_group(tsk, true);
 
 	if (unlikely(running))
 		tsk->sched_class->set_curr_task(rq);
@@ -8182,7 +8195,14 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 
 static void cpu_cgroup_fork(struct task_struct *task)
 {
-	sched_move_task(task);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(task, &rf);
+
+	sched_set_group(task, false);
+
+	task_rq_unlock(rq, task, &rf);
 }
 
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 16:30       ` Vincent Guittot
@ 2016-06-16 17:17         ` Peter Zijlstra
  2016-06-16 18:57           ` Vincent Guittot
  2016-06-16 18:51         ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-06-16 17:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
> Le Wednesday 15 Jun 2016 à 17:22:17 (+0200), Peter Zijlstra a écrit :
> > On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
> > > I still have concerned with this change of the behavior that  attaches
> > > the task only when it is enqueued. The load avg of the task will not
> > > be decayed between the time we move it into its new group until its
> > > enqueue. With this change, a task's load can stay high whereas it has
> > > slept for the last couple of seconds. Then, its load and utilization
> > > is no more accounted anywhere in the mean time just because we have
> > > moved the task which will be enqueued on the same rq.
> > > A task should always be attached to a cfs_rq and its load/utilization
> > > should always be accounted on a cfs_rq and decayed for its sleep
> > > period
> > 
> > OK; so I think I agree with that. Does the below (completely untested,
> > hasn't even been near a compiler) look reasonable?
> > 
> > The general idea is to always attach to a cfs_rq -- through
> > post_init_entity_util_avg(). This covers both the new task isn't
> > attached yet and was never in the fair class to begin with issues.
> 
> Your patch ensures that a task will be attached to a cfs_rq and fix
> the issue raised by Yuyang because of se->avg.last_update_time = 0 at
> init.

> During the test the following message has raised  "BUG: using
> smp_processor_id() in preemptible [00000000] code: systemd/1" because
> of cfs_rq_util_change that is called in attach_entity_load_avg

But per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

that should be with rq->lock held !?

> With patch [1] for the init of cfs_rq side, all use cases will be
> covered regarding the issue linked to a last_update_time set to 0 at
> init [1] https://lkml.org/lkml/2016/5/30/508

Yes, I saw that patch. However, it felt strange to me to set
last_update_time to 1, would that not result in a massive aging because
now - last_update_time ends up as a giant delta?

By doing attach_entity_load_avg() we actually add the initial values to
the cfs_rq avg and set last_update_time to the current time. Ensuring
'regular' contribution and progress.

> > That only leaves a tiny hole in fork() where the task is hashed but
> > hasn't yet passed through wake_up_new_task() in which someone can do
> > cgroup move on it. That is closed with TASK_NEW and can_attach()
> > refusing those tasks.
> > 
> 
> But a new fair task is still detached and attached from/to task_group
> with

> cgroup_post_fork()-->
    ss->fork(child)-->
      cpu_cgroup_fork()-->
        sched_move_task()-->
	  task_move_group_fair().

Blergh, I knew I missed something in there..

> cpu_cgroup_can_attach is not used in this path and sched_move_task do
> the move unconditionally for fair task.
> 
> With your patch, we still have the sequence
> 
> sched_fork()
>     set_task_cpu()
> cgroup_post_fork()--> ... --> task_move_group_fair()
>     detach_task_cfs_rq()
>     set_task_rq()
>     attach_task_cfs_rq()
> wake_up_new_task()
>     select_task_rq() can select a new cpu
>     set_task_cpu()
>         migrate_task_rq_fair if the new_cpu != task_cpu
>              remove_load()
>         __set_task_cpu
>     post_init_entity_util_avg
>         attach_task_cfs_rq()
>     activate_task
>         enqueue_task
> 
> In fact, cpu_cgroup_fork needs a small part of sched_move_task so we
> can just call this small part directly instead sched_move_task. And
> the task doesn't really migrate because it is not yet attached so we
> need the sequence:

> sched_fork()
>     __set_task_cpu()
> cgroup_post_fork()--> ... --> task_move_group_fair()
>     set_task_rq() to set task group and runqueue
> wake_up_new_task()
>     select_task_rq() can select a new cpu
>     __set_task_cpu
>     post_init_entity_util_avg
>         attach_task_cfs_rq()
>     activate_task
>         enqueue_task
> 
> The patch below on top of your patch, ensures that we follow the right sequence :
> 
> ---
>  kernel/sched/core.c | 60 +++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 40 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7895689a..a21e3dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2373,7 +2373,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	 * Silence PROVE_RCU.
>  	 */
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	set_task_cpu(p, cpu);
> +	__set_task_cpu(p, cpu);

Indeed, this should not be calling migrate, we're setting the CPU for
the first time.

>  	if (p->sched_class->task_fork)
>  		p->sched_class->task_fork(p);
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> @@ -2515,7 +2515,7 @@ void wake_up_new_task(struct task_struct *p)
>  	 *  - cpus_allowed can change in the fork path
>  	 *  - any previously selected cpu might disappear through hotplug
>  	 */
> -	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> +	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));

Similarly here I suppose, since we're not yet properly attached as such.

>  #endif
>  	/* Post initialize new task's util average when its cfs_rq is set */
>  	post_init_entity_util_avg(&p->se);

You were indeed running an 'old' kernel, as this
post_init_entity_util_avg() call should be _after_ the __task_rq_lock().

> @@ -7715,6 +7715,35 @@ void sched_offline_group(struct task_group *tg)
>  	spin_unlock_irqrestore(&task_group_lock, flags);
>  }
>  
> +/* Set task's runqueue and group
> + *     In case of a move between group, we update src and dst group
> + *     thanks to sched_class->task_move_group. Otherwise, we just need to set
> + *     runqueue and group pointers. The task will be attached to the runqueue
> + *     during its wake up.

Broken comment style.

> + */
> +static void sched_set_group(struct task_struct *tsk, bool move)
> +{
> +	struct task_group *tg;
> +
> +	/*
> +	 * All callers are synchronized by task_rq_lock(); we do not use RCU
> +	 * which is pointless here. Thus, we pass "true" to task_css_check()
> +	 * to prevent lockdep warnings.
> +	 */
> +	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
> +			  struct task_group, css);
> +	tg = autogroup_task_group(tsk, tg);
> +	tsk->sched_task_group = tg;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +	if (move && tsk->sched_class->task_move_group)
> +		tsk->sched_class->task_move_group(tsk);
> +	else
> +#endif
> +		set_task_rq(tsk, task_cpu(tsk));
> +
> +}
> +
>  /* change task's runqueue when it moves between groups.
>   *	The caller of this function should have put the task in its new group
>   *	by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
> @@ -7722,7 +7751,6 @@ void sched_offline_group(struct task_group *tg)
>   */
>  void sched_move_task(struct task_struct *tsk)
>  {
> -	struct task_group *tg;
>  	int queued, running;
>  	struct rq_flags rf;
>  	struct rq *rq;
> @@ -7737,22 +7765,7 @@ void sched_move_task(struct task_struct *tsk)
>  	if (unlikely(running))
>  		put_prev_task(rq, tsk);
>  
> +	sched_set_group(tsk, true);
>  
>  	if (unlikely(running))
>  		tsk->sched_class->set_curr_task(rq);
> @@ -8182,7 +8195,14 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>  
>  static void cpu_cgroup_fork(struct task_struct *task)
>  {
> -	sched_move_task(task);
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	rq = task_rq_lock(task, &rf);
> +
> +	sched_set_group(task, false);
> +
> +	task_rq_unlock(rq, task, &rf);
>  }

Hmm, yeah, I think you're right.

Let me fold that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 16:30       ` Vincent Guittot
  2016-06-16 17:17         ` Peter Zijlstra
@ 2016-06-16 18:51         ` Peter Zijlstra
  2016-06-16 19:00           ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-06-16 18:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
> With patch [1] for the init of cfs_rq side, all use cases will be
> covered regarding the issue linked to a last_update_time set to 0 at
> init
> [1] https://lkml.org/lkml/2016/5/30/508

Aah, wait, now I get it :-)

Still, we should put cfs_rq_clock_task(cfs_rq) in it, not 1. And since
we now acquire rq->lock on init this should well be possible. Lemme sort
that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 17:17         ` Peter Zijlstra
@ 2016-06-16 18:57           ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-06-16 18:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 16 June 2016 at 19:17, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
>> Le Wednesday 15 Jun 2016 à 17:22:17 (+0200), Peter Zijlstra a écrit :
>> > On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
>> > > I still have concerned with this change of the behavior that  attaches
>> > > the task only when it is enqueued. The load avg of the task will not
>> > > be decayed between the time we move it into its new group until its
>> > > enqueue. With this change, a task's load can stay high whereas it has
>> > > slept for the last couple of seconds. Then, its load and utilization
>> > > is no more accounted anywhere in the mean time just because we have
>> > > moved the task which will be enqueued on the same rq.
>> > > A task should always be attached to a cfs_rq and its load/utilization
>> > > should always be accounted on a cfs_rq and decayed for its sleep
>> > > period
>> >
>> > OK; so I think I agree with that. Does the below (completely untested,
>> > hasn't even been near a compiler) look reasonable?
>> >
>> > The general idea is to always attach to a cfs_rq -- through
>> > post_init_entity_util_avg(). This covers both the new task isn't
>> > attached yet and was never in the fair class to begin with issues.
>>
>> Your patch ensures that a task will be attached to a cfs_rq and fix
>> the issue raised by Yuyang because of se->avg.last_update_time = 0 at
>> init.
>
>> During the test the following message has raised  "BUG: using
>> smp_processor_id() in preemptible [00000000] code: systemd/1" because
>> of cfs_rq_util_change that is called in attach_entity_load_avg
>
> But per commit:
>
>   b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
>
> that should be with rq->lock held !?

yes, you're right, i forgot to fetch latest commits

>
>> With patch [1] for the init of cfs_rq side, all use cases will be
>> covered regarding the issue linked to a last_update_time set to 0 at
>> init [1] https://lkml.org/lkml/2016/5/30/508
>
> Yes, I saw that patch. However, it felt strange to me to set
> last_update_time to 1, would that not result in a massive aging because
> now - last_update_time ends up as a giant delta?

yes, it will but it's similar to what can also happen when the other
tasks will be enqueued in the cfs_rq. The cfs_rq->avg.last_update_time
can be really far from now for cfs_rq that is idle.

>
> By doing attach_entity_load_avg() we actually add the initial values to
> the cfs_rq avg and set last_update_time to the current time. Ensuring
> 'regular' contribution and progress.
>
>> > That only leaves a tiny hole in fork() where the task is hashed but
>> > hasn't yet passed through wake_up_new_task() in which someone can do
>> > cgroup move on it. That is closed with TASK_NEW and can_attach()
>> > refusing those tasks.
>> >
>>
>> But a new fair task is still detached and attached from/to task_group
>> with
>
>> cgroup_post_fork()-->
>     ss->fork(child)-->
>       cpu_cgroup_fork()-->
>         sched_move_task()-->
>           task_move_group_fair().
>
> Blergh, I knew I missed something in there..
>
>> cpu_cgroup_can_attach is not used in this path and sched_move_task do
>> the move unconditionally for fair task.
>>
>> With your patch, we still have the sequence
>>
>> sched_fork()
>>     set_task_cpu()
>> cgroup_post_fork()--> ... --> task_move_group_fair()
>>     detach_task_cfs_rq()
>>     set_task_rq()
>>     attach_task_cfs_rq()
>> wake_up_new_task()
>>     select_task_rq() can select a new cpu
>>     set_task_cpu()
>>         migrate_task_rq_fair if the new_cpu != task_cpu
>>              remove_load()
>>         __set_task_cpu
>>     post_init_entity_util_avg
>>         attach_task_cfs_rq()
>>     activate_task
>>         enqueue_task
>>
>> In fact, cpu_cgroup_fork needs a small part of sched_move_task so we
>> can just call this small part directly instead sched_move_task. And
>> the task doesn't really migrate because it is not yet attached so we
>> need the sequence:
>
>> sched_fork()
>>     __set_task_cpu()
>> cgroup_post_fork()--> ... --> task_move_group_fair()
>>     set_task_rq() to set task group and runqueue
>> wake_up_new_task()
>>     select_task_rq() can select a new cpu
>>     __set_task_cpu
>>     post_init_entity_util_avg
>>         attach_task_cfs_rq()
>>     activate_task
>>         enqueue_task
>>
>> The patch below on top of your patch, ensures that we follow the right sequence :
>>
>> ---
>>  kernel/sched/core.c | 60 +++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 40 insertions(+), 20 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 7895689a..a21e3dc 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2373,7 +2373,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>>        * Silence PROVE_RCU.
>>        */
>>       raw_spin_lock_irqsave(&p->pi_lock, flags);
>> -     set_task_cpu(p, cpu);
>> +     __set_task_cpu(p, cpu);
>
> Indeed, this should not be calling migrate, we're setting the CPU for
> the first time.
>
>>       if (p->sched_class->task_fork)
>>               p->sched_class->task_fork(p);
>>       raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>> @@ -2515,7 +2515,7 @@ void wake_up_new_task(struct task_struct *p)
>>        *  - cpus_allowed can change in the fork path
>>        *  - any previously selected cpu might disappear through hotplug
>>        */
>> -     set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> +     __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>
> Similarly here I suppose, since we're not yet properly attached as such.
>
>>  #endif
>>       /* Post initialize new task's util average when its cfs_rq is set */
>>       post_init_entity_util_avg(&p->se);
>
> You were indeed running an 'old' kernel, as this
> post_init_entity_util_avg() call should be _after_ the __task_rq_lock().
>
>> @@ -7715,6 +7715,35 @@ void sched_offline_group(struct task_group *tg)
>>       spin_unlock_irqrestore(&task_group_lock, flags);
>>  }
>>
>> +/* Set task's runqueue and group
>> + *     In case of a move between group, we update src and dst group
>> + *     thanks to sched_class->task_move_group. Otherwise, we just need to set
>> + *     runqueue and group pointers. The task will be attached to the runqueue
>> + *     during its wake up.
>
> Broken comment style.
>
>> + */
>> +static void sched_set_group(struct task_struct *tsk, bool move)
>> +{
>> +     struct task_group *tg;
>> +
>> +     /*
>> +      * All callers are synchronized by task_rq_lock(); we do not use RCU
>> +      * which is pointless here. Thus, we pass "true" to task_css_check()
>> +      * to prevent lockdep warnings.
>> +      */
>> +     tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
>> +                       struct task_group, css);
>> +     tg = autogroup_task_group(tsk, tg);
>> +     tsk->sched_task_group = tg;
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +     if (move && tsk->sched_class->task_move_group)
>> +             tsk->sched_class->task_move_group(tsk);
>> +     else
>> +#endif
>> +             set_task_rq(tsk, task_cpu(tsk));
>> +
>> +}
>> +
>>  /* change task's runqueue when it moves between groups.
>>   *   The caller of this function should have put the task in its new group
>>   *   by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
>> @@ -7722,7 +7751,6 @@ void sched_offline_group(struct task_group *tg)
>>   */
>>  void sched_move_task(struct task_struct *tsk)
>>  {
>> -     struct task_group *tg;
>>       int queued, running;
>>       struct rq_flags rf;
>>       struct rq *rq;
>> @@ -7737,22 +7765,7 @@ void sched_move_task(struct task_struct *tsk)
>>       if (unlikely(running))
>>               put_prev_task(rq, tsk);
>>
>> +     sched_set_group(tsk, true);
>>
>>       if (unlikely(running))
>>               tsk->sched_class->set_curr_task(rq);
>> @@ -8182,7 +8195,14 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>>
>>  static void cpu_cgroup_fork(struct task_struct *task)
>>  {
>> -     sched_move_task(task);
>> +     struct rq_flags rf;
>> +     struct rq *rq;
>> +
>> +     rq = task_rq_lock(task, &rf);
>> +
>> +     sched_set_group(task, false);
>> +
>> +     task_rq_unlock(rq, task, &rf);
>>  }
>
> Hmm, yeah, I think you're right.
>
> Let me fold that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 18:51         ` Peter Zijlstra
@ 2016-06-16 19:00           ` Vincent Guittot
  2016-06-16 20:07             ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2016-06-16 19:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 16 June 2016 at 20:51, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
>> With patch [1] for the init of cfs_rq side, all use cases will be
>> covered regarding the issue linked to a last_update_time set to 0 at
>> init
>> [1] https://lkml.org/lkml/2016/5/30/508
>
> Aah, wait, now I get it :-)
>
> Still, we should put cfs_rq_clock_task(cfs_rq) in it, not 1. And since
> we now acquire rq->lock on init this should well be possible. Lemme sort
> that.

yes with the rq->lock we can use cfs_rq_clock_task which is make more
sense than 1.
But the delta can be still significant between the creation of the
task group and the 1st task that will be attach to the cfs_rq

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 19:00           ` Vincent Guittot
@ 2016-06-16 20:07             ` Peter Zijlstra
  2016-06-16 21:21               ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2016-06-16 20:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Thu, Jun 16, 2016 at 09:00:57PM +0200, Vincent Guittot wrote:
> On 16 June 2016 at 20:51, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
> >> With patch [1] for the init of cfs_rq side, all use cases will be
> >> covered regarding the issue linked to a last_update_time set to 0 at
> >> init
> >> [1] https://lkml.org/lkml/2016/5/30/508
> >
> > Aah, wait, now I get it :-)
> >
> > Still, we should put cfs_rq_clock_task(cfs_rq) in it, not 1. And since
> > we now acquire rq->lock on init this should well be possible. Lemme sort
> > that.
> 
> yes with the rq->lock we can use cfs_rq_clock_task which is make more
> sense than 1.
> But the delta can be still significant between the creation of the
> task group and the 1st task that will be attach to the cfs_rq

Ah, I think I've spotted more fail.

And I think you're right, it doesn't matter, in fact, 0 should have been
fine too!

enqueue_entity()
  enqueue_entity_load_avg()
    update_cfs_rq_load_avg()
      now = clock()
      __update_load_avg(&cfs_rq->avg)
        cfs_rq->avg.last_load_update = now
        // ages 0 load/util for: now - 0
    if (migrated)
      attach_entity_load_avg()
        se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0

So I don't see how it can end up being attached again.


Now I do see another problem, and that is that we're forgetting to
update_cfs_rq_load_avg() in all detach_entity_load_avg() callers and all
but the enqueue caller of attach_entity_load_avg().

Something like the below.



---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f75930bdd326..5d8fa135bbc5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8349,6 +8349,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (!vruntime_normalized(p)) {
 		/*
@@ -8360,6 +8361,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
+	update_cfs_rq_load_avg(now, cfs_rq, false);
 	detach_entity_load_avg(cfs_rq, se);
 }
 
@@ -8367,6 +8369,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -8377,6 +8380,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 #endif
 
 	/* Synchronize task with its cfs_rq */
+	update_cfs_rq_load_avg(now, cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 
 	if (!vruntime_normalized(p))

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 20:07             ` Peter Zijlstra
@ 2016-06-16 21:21               ` Vincent Guittot
  2016-06-17  2:12                 ` Yuyang Du
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-06-16 21:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 16 June 2016 at 22:07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 16, 2016 at 09:00:57PM +0200, Vincent Guittot wrote:
>> On 16 June 2016 at 20:51, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
>> >> With patch [1] for the init of cfs_rq side, all use cases will be
>> >> covered regarding the issue linked to a last_update_time set to 0 at
>> >> init
>> >> [1] https://lkml.org/lkml/2016/5/30/508
>> >
>> > Aah, wait, now I get it :-)
>> >
>> > Still, we should put cfs_rq_clock_task(cfs_rq) in it, not 1. And since
>> > we now acquire rq->lock on init this should well be possible. Lemme sort
>> > that.
>>
>> yes with the rq->lock we can use cfs_rq_clock_task which is make more
>> sense than 1.
>> But the delta can be still significant between the creation of the
>> task group and the 1st task that will be attach to the cfs_rq
>
> Ah, I think I've spotted more fail.
>
> And I think you're right, it doesn't matter, in fact, 0 should have been
> fine too!
>
> enqueue_entity()
>   enqueue_entity_load_avg()
>     update_cfs_rq_load_avg()
>       now = clock()
>       __update_load_avg(&cfs_rq->avg)
>         cfs_rq->avg.last_load_update = now
>         // ages 0 load/util for: now - 0
>     if (migrated)
>       attach_entity_load_avg()
>         se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0
>
> So I don't see how it can end up being attached again.

In fact it has already been attached during the sched_move_task. The
sequence for the 1st task that is attached to a cfs_rq is :

sched_move_task()
  task_move_group_fair()
    detach_task_cfs_rq()
    set_task_rq()
    attach_task_cfs_rq()
      attach_entity_load_avg()
        se->avg.last_load_update = cfs_rq->avg.last_load_update == 0

Then we enqueue the task but se->avg.last_load_update is still 0 so
migrated is set and we attach the task one more time

>
>
> Now I do see another problem, and that is that we're forgetting to
> update_cfs_rq_load_avg() in all detach_entity_load_avg() callers and all
> but the enqueue caller of attach_entity_load_avg().

Yes, calling update_cfs_rq_load_avg before all attach_entity_load_avg
will ensure that cfs_rq->avg.last_load_update will never be 0 when
attaching a task
And doing that before the detach will ensure that we move an up-to-date load

Your proposal below looks good to me

>
> Something like the below.
>
>
>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f75930bdd326..5d8fa135bbc5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8349,6 +8349,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
>  {
>         struct sched_entity *se = &p->se;
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 now = cfs_rq_clock_task(cfs_rq);
>
>         if (!vruntime_normalized(p)) {
>                 /*
> @@ -8360,6 +8361,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
>         }
>
>         /* Catch up with the cfs_rq and remove our load when we leave */
> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>         detach_entity_load_avg(cfs_rq, se);
>  }
>
> @@ -8367,6 +8369,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
>  {
>         struct sched_entity *se = &p->se;
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 now = cfs_rq_clock_task(cfs_rq);
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         /*
> @@ -8377,6 +8380,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
>  #endif
>
>         /* Synchronize task with its cfs_rq */
> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>         attach_entity_load_avg(cfs_rq, se);
>
>         if (!vruntime_normalized(p))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 21:21               ` Vincent Guittot
@ 2016-06-17  2:12                 ` Yuyang Du
  2016-06-17 12:00                   ` Vincent Guittot
  2016-06-17  9:48                 ` Peter Zijlstra
  2016-06-17 11:31                 ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Yuyang Du @ 2016-06-17  2:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Thu, Jun 16, 2016 at 11:21:55PM +0200, Vincent Guittot wrote:
> On 16 June 2016 at 22:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Jun 16, 2016 at 09:00:57PM +0200, Vincent Guittot wrote:
> >> On 16 June 2016 at 20:51, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
> >> >> With patch [1] for the init of cfs_rq side, all use cases will be
> >> >> covered regarding the issue linked to a last_update_time set to 0 at
> >> >> init
> >> >> [1] https://lkml.org/lkml/2016/5/30/508
> >> >
> >> > Aah, wait, now I get it :-)
> >> >
> >> > Still, we should put cfs_rq_clock_task(cfs_rq) in it, not 1. And since
> >> > we now acquire rq->lock on init this should well be possible. Lemme sort
> >> > that.
> >>
> >> yes with the rq->lock we can use cfs_rq_clock_task which is make more
> >> sense than 1.
> >> But the delta can be still significant between the creation of the
> >> task group and the 1st task that will be attach to the cfs_rq
> >
> > Ah, I think I've spotted more fail.
> >
> > And I think you're right, it doesn't matter, in fact, 0 should have been
> > fine too!
> >
> > enqueue_entity()
> >   enqueue_entity_load_avg()
> >     update_cfs_rq_load_avg()
> >       now = clock()
> >       __update_load_avg(&cfs_rq->avg)
> >         cfs_rq->avg.last_load_update = now
> >         // ages 0 load/util for: now - 0
> >     if (migrated)
> >       attach_entity_load_avg()
> >         se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0
> >
> > So I don't see how it can end up being attached again.
> 
> In fact it has already been attached during the sched_move_task. The
> sequence for the 1st task that is attached to a cfs_rq is :
> 
> sched_move_task()
>   task_move_group_fair()
>     detach_task_cfs_rq()
>     set_task_rq()
>     attach_task_cfs_rq()
>       attach_entity_load_avg()
>         se->avg.last_load_update = cfs_rq->avg.last_load_update == 0
> 

Then again, does this fix it?

static void task_move_group_fair(struct task_struct *p)
{
        detach_task_cfs_rq(p);
        set_task_rq(p, task_cpu(p));
        attach_task_cfs_rq(p);
        /*
         * If the cfs_rq's last_update_time is 0, attach the sched avgs
         * won't be anything useful, as it will be decayed to 0 when any
         * sched_entity is enqueued to that cfs_rq.
         *
         * On the other hand, if the cfs_rq's last_update_time is 0, we
         * must reset the task's last_update_time to ensure we will attach
         * the sched avgs when the task is enqueued.
         */
        if (!cfs_rq_of(&p->se)->avg.last_update_time)
                reset_task_last_update_time(p);
        else
                attach_entity_load_avg(cfs_rq_of(&p->se), &p->se);
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 21:21               ` Vincent Guittot
  2016-06-17  2:12                 ` Yuyang Du
@ 2016-06-17  9:48                 ` Peter Zijlstra
  2016-06-17 11:31                 ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-06-17  9:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Thu, Jun 16, 2016 at 11:21:55PM +0200, Vincent Guittot wrote:
> On 16 June 2016 at 22:07, Peter Zijlstra <peterz@infradead.org> wrote:

> > So I don't see how it can end up being attached again.
> 
> In fact it has already been attached during the sched_move_task. The
> sequence for the 1st task that is attached to a cfs_rq is :
> 
> sched_move_task()
>   task_move_group_fair()
>     detach_task_cfs_rq()
>     set_task_rq()
>     attach_task_cfs_rq()
>       attach_entity_load_avg()
>         se->avg.last_load_update = cfs_rq->avg.last_load_update == 0
> 
> Then we enqueue the task

> > enqueue_entity()
> >   enqueue_entity_load_avg()
> >     update_cfs_rq_load_avg()
> >       now = clock()
> >       __update_load_avg(&cfs_rq->avg)
> >         cfs_rq->avg.last_load_update = now
> >         // ages 0 load/util for: now - 0
> >     if (migrated)
> >       attach_entity_load_avg()
> >         se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0

> but se->avg.last_load_update is still 0 so
> migrated is set and we attach the task one more time

Ok, we indeed attach twice here, however I don't actually see it be a
problem (much). Because the aging during the enqueue will age the entire
first attach right back down to 0 again; now-0 is a _huge_ delta and
will flatten everything right down to 0.

Of course, that also ages the new entity right back down to 0, making it
appear like it has no load what so ever.

But yes, /me puts in changelog.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-16 21:21               ` Vincent Guittot
  2016-06-17  2:12                 ` Yuyang Du
  2016-06-17  9:48                 ` Peter Zijlstra
@ 2016-06-17 11:31                 ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2016-06-17 11:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Thu, Jun 16, 2016 at 11:21:55PM +0200, Vincent Guittot wrote:
> Your proposal below looks good to me

> > ---
> >  kernel/sched/fair.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f75930bdd326..5d8fa135bbc5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8349,6 +8349,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
> >  {
> >         struct sched_entity *se = &p->se;
> >         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +       u64 now = cfs_rq_clock_task(cfs_rq);
> >
> >         if (!vruntime_normalized(p)) {
> >                 /*
> > @@ -8360,6 +8361,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
> >         }
> >
> >         /* Catch up with the cfs_rq and remove our load when we leave */
> > +       update_cfs_rq_load_avg(now, cfs_rq, false);
> >         detach_entity_load_avg(cfs_rq, se);
> >  }
> >
> > @@ -8367,6 +8369,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
> >  {
> >         struct sched_entity *se = &p->se;
> >         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +       u64 now = cfs_rq_clock_task(cfs_rq);
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >         /*
> > @@ -8377,6 +8380,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
> >  #endif
> >
> >         /* Synchronize task with its cfs_rq */
> > +       update_cfs_rq_load_avg(now, cfs_rq, false);
> >         attach_entity_load_avg(cfs_rq, se);
> >
> >         if (!vruntime_normalized(p))

Should we also call update_tg_load_avg() if update_cfs_rq_load_avg()
returns? Most sites seem to do this.

Someone should document these things somewhere....

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
  2016-06-17  2:12                 ` Yuyang Du
@ 2016-06-17 12:00                   ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2016-06-17 12:00 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 17 June 2016 at 04:12, Yuyang Du <yuyang.du@intel.com> wrote:
>
> On Thu, Jun 16, 2016 at 11:21:55PM +0200, Vincent Guittot wrote:
> > On 16 June 2016 at 22:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Thu, Jun 16, 2016 at 09:00:57PM +0200, Vincent Guittot wrote:
> > >> On 16 June 2016 at 20:51, Peter Zijlstra <peterz@infradead.org> wrote:
> > >> > On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
> > >> >> With patch [1] for the init of cfs_rq side, all use cases will be
> > >> >> covered regarding the issue linked to a last_update_time set to 0 at
> > >> >> init
> > >> >> [1] https://lkml.org/lkml/2016/5/30/508
> > >> >
> > >> > Aah, wait, now I get it :-)
> > >> >
> > >> > Still, we should put cfs_rq_clock_task(cfs_rq) in it, not 1. And since
> > >> > we now acquire rq->lock on init this should well be possible. Lemme sort
> > >> > that.
> > >>
> > >> yes with the rq->lock we can use cfs_rq_clock_task which is make more
> > >> sense than 1.
> > >> But the delta can be still significant between the creation of the
> > >> task group and the 1st task that will be attach to the cfs_rq
> > >
> > > Ah, I think I've spotted more fail.
> > >
> > > And I think you're right, it doesn't matter, in fact, 0 should have been
> > > fine too!
> > >
> > > enqueue_entity()
> > >   enqueue_entity_load_avg()
> > >     update_cfs_rq_load_avg()
> > >       now = clock()
> > >       __update_load_avg(&cfs_rq->avg)
> > >         cfs_rq->avg.last_load_update = now
> > >         // ages 0 load/util for: now - 0
> > >     if (migrated)
> > >       attach_entity_load_avg()
> > >         se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0
> > >
> > > So I don't see how it can end up being attached again.
> >
> > In fact it has already been attached during the sched_move_task. The
> > sequence for the 1st task that is attached to a cfs_rq is :
> >
> > sched_move_task()
> >   task_move_group_fair()
> >     detach_task_cfs_rq()
> >     set_task_rq()
> >     attach_task_cfs_rq()
> >       attach_entity_load_avg()
> >         se->avg.last_load_update = cfs_rq->avg.last_load_update == 0
> >
>
> Then again, does this fix it?

Peter's proposal to update the cfs_rq before attaching/detaching the
task,  fixes the problem

>
>
> static void task_move_group_fair(struct task_struct *p)
> {
>         detach_task_cfs_rq(p);
>         set_task_rq(p, task_cpu(p));
>         attach_task_cfs_rq(p);
>         /*
>          * If the cfs_rq's last_update_time is 0, attach the sched avgs
>          * won't be anything useful, as it will be decayed to 0 when any
>          * sched_entity is enqueued to that cfs_rq.
>          *
>          * On the other hand, if the cfs_rq's last_update_time is 0, we
>          * must reset the task's last_update_time to ensure we will attach
>          * the sched avgs when the task is enqueued.
>          */
>         if (!cfs_rq_of(&p->se)->avg.last_update_time)
>                 reset_task_last_update_time(p);
>         else
>                 attach_entity_load_avg(cfs_rq_of(&p->se), &p->se);
> }

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-06-17 12:01 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 22:21 [PATCH v6 0/4] sched/fair: Fix attach and detach sched avgs for task group change and sched class change Yuyang Du
2016-06-14 22:21 ` [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group Yuyang Du
2016-06-15  7:46   ` Vincent Guittot
2016-06-15  0:18     ` Yuyang Du
2016-06-15 14:15       ` Vincent Guittot
2016-06-15 15:22     ` Peter Zijlstra
2016-06-16  1:00       ` Yuyang Du
2016-06-16 16:30       ` Vincent Guittot
2016-06-16 17:17         ` Peter Zijlstra
2016-06-16 18:57           ` Vincent Guittot
2016-06-16 18:51         ` Peter Zijlstra
2016-06-16 19:00           ` Vincent Guittot
2016-06-16 20:07             ` Peter Zijlstra
2016-06-16 21:21               ` Vincent Guittot
2016-06-17  2:12                 ` Yuyang Du
2016-06-17 12:00                   ` Vincent Guittot
2016-06-17  9:48                 ` Peter Zijlstra
2016-06-17 11:31                 ` Peter Zijlstra
2016-06-14 22:21 ` [PATCH v6 2/4] sched/fair: Move load and util avgs from wake_up_new_task() to sched_fork() Yuyang Du
2016-06-14 22:21 ` [PATCH v6 3/4] sched/fair: Skip detach sched avgs for new task when changing task groups Yuyang Du
2016-06-14 22:21 ` [PATCH v6 4/4] sched/fair: Add inline to detach_entity_load_evg() Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).