All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] sched/fair: Fix PELT wobblies
@ 2016-06-17 12:01 Peter Zijlstra
  2016-06-17 12:01 ` [PATCH 1/4] sched: Optimize fork() paths Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot
  Cc: Peter Zijlstra

These here patches should hopefully fix all these pesky new task/group double
attach wobblies for which there have been a number of patch sets floating
about.

They have had minimal testing; they boot on x86_64 and I could create a cgroup
and stuff my bash in it.

Please as to have a close look.

Much thanks to Vincent and Yuyang.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] sched: Optimize fork() paths
  2016-06-17 12:01 [PATCH 0/4] sched/fair: Fix PELT wobblies Peter Zijlstra
@ 2016-06-17 12:01 ` Peter Zijlstra
  2016-06-17 12:01 ` [PATCH 2/4] sched/fair: Fix PELT integrity for new groups Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot
  Cc: Peter Zijlstra

[-- Attachment #1: peterz-sched-fork-1.patch --]
[-- Type: text/plain, Size: 2739 bytes --]

The task_fork_fair() callback already calls __set_task_cpu() and takes
rq->lock.

If we move the sched_class::task_fork callback in sched_fork() under
the existing p->pi_lock, right after its set_task_cpu() call, we can
avoid doing two such calls and omit the IRQ disabling on the rq->lock.

Change to __set_task_cpu() to skip the migration bits, this is a new
task, not a migration.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    7 +++----
 kernel/sched/fair.c |   27 ++++++---------------------
 2 files changed, 9 insertions(+), 25 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2381,9 +2381,6 @@ int sched_fork(unsigned long clone_flags
 		p->sched_class = &fair_sched_class;
 	}
 
-	if (p->sched_class->task_fork)
-		p->sched_class->task_fork(p);
-
 	/*
 	 * The child is not yet in the pid-hash so no cgroup attach races,
 	 * and the cgroup is pinned to this child due to cgroup_fork()
@@ -2392,7 +2389,9 @@ int sched_fork(unsigned long clone_flags
 	 * Silence PROVE_RCU.
 	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	set_task_cpu(p, cpu);
+	__set_task_cpu(p, cpu);
+	if (p->sched_class->task_fork)
+		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 #ifdef CONFIG_SCHED_INFO
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4418,7 +4418,7 @@ enqueue_task_fair(struct rq *rq, struct
 		 *
 		 * note: in the case of encountering a throttled cfs_rq we will
 		 * post the final h_nr_running increment below.
-		*/
+		 */
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 		cfs_rq->h_nr_running++;
@@ -8255,31 +8255,17 @@ static void task_fork_fair(struct task_s
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se, *curr;
-	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
 
+	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
 
 	cfs_rq = task_cfs_rq(current);
 	curr = cfs_rq->curr;
-
-	/*
-	 * Not only the cpu but also the task_group of the parent might have
-	 * been changed after parent->se.parent,cfs_rq were copied to
-	 * child->se.parent,cfs_rq. So call __set_task_cpu() to make those
-	 * of child point to valid ones.
-	 */
-	rcu_read_lock();
-	__set_task_cpu(p, this_cpu);
-	rcu_read_unlock();
-
-	update_curr(cfs_rq);
-
-	if (curr)
+	if (curr) {
+		update_curr(cfs_rq);
 		se->vruntime = curr->vruntime;
+	}
 	place_entity(cfs_rq, se, 1);
 
 	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
@@ -8292,8 +8278,7 @@ static void task_fork_fair(struct task_s
 	}
 
 	se->vruntime -= cfs_rq->min_vruntime;
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock(&rq->lock);
 }
 
 /*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/4] sched/fair: Fix PELT integrity for new groups
  2016-06-17 12:01 [PATCH 0/4] sched/fair: Fix PELT wobblies Peter Zijlstra
  2016-06-17 12:01 ` [PATCH 1/4] sched: Optimize fork() paths Peter Zijlstra
@ 2016-06-17 12:01 ` Peter Zijlstra
  2016-06-17 13:51   ` Vincent Guittot
  2016-06-17 12:01 ` [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork() Peter Zijlstra
  2016-06-17 12:01 ` [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks Peter Zijlstra
  3 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot
  Cc: Peter Zijlstra

[-- Attachment #1: peterz-sched-fork-4.patch --]
[-- Type: text/plain, Size: 2417 bytes --]

Vincent reported that when a new task is moved into a new cgroup it
gets attached twice to the load tracking.

  sched_move_task()
    task_move_group_fair()
      detach_task_cfs_rq()
      set_task_rq()
      attach_task_cfs_rq()
        attach_entity_load_avg()
          se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0

  enqueue_entity()
    enqueue_entity_load_avg()
      update_cfs_rq_load_avg()
        now = clock()
        __update_load_avg(&cfs_rq->avg)
          cfs_rq->avg.last_load_update = now
          // ages load/util for: now - 0, load/util -> 0
      if (migrated)
        attach_entity_load_avg()
          se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0

The problem is that we don't update cfs_rq load_avg before all
entity attach/detach operations. Only enqueue and migrate_task do
this.

By fixing this, the above will not happen, because the
sched_move_task() attach will have updated cfs_rq's last_load_update
time before attach, and in turn the attach will have set the entity's
last_load_update stamp.

Note that there is a further problem with sched_move_task() calling
detach on a task that hasn't yet been attached; this will be taken
care of in a subsequent patch.

Cc: Yuyang Du <yuyang.du@intel.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8366,6 +8366,7 @@ static void detach_task_cfs_rq(struct ta
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (!vruntime_normalized(p)) {
 		/*
@@ -8377,6 +8378,7 @@ static void detach_task_cfs_rq(struct ta
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
+	update_cfs_rq_load_avg(now, cfs_rq, false);
 	detach_entity_load_avg(cfs_rq, se);
 }
 
@@ -8384,6 +8386,7 @@ static void attach_task_cfs_rq(struct ta
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -8394,6 +8397,7 @@ static void attach_task_cfs_rq(struct ta
 #endif
 
 	/* Synchronize task with its cfs_rq */
+	update_cfs_rq_load_avg(now, cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 
 	if (!vruntime_normalized(p))

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork()
  2016-06-17 12:01 [PATCH 0/4] sched/fair: Fix PELT wobblies Peter Zijlstra
  2016-06-17 12:01 ` [PATCH 1/4] sched: Optimize fork() paths Peter Zijlstra
  2016-06-17 12:01 ` [PATCH 2/4] sched/fair: Fix PELT integrity for new groups Peter Zijlstra
@ 2016-06-17 12:01 ` Peter Zijlstra
  2016-06-17 13:58   ` Vincent Guittot
  2016-06-17 12:01 ` [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks Peter Zijlstra
  3 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot
  Cc: Peter Zijlstra

[-- Attachment #1: vincent-fork-1.patch --]
[-- Type: text/plain, Size: 3982 bytes --]

From: Vincent Guittot <vincent.guittot@linaro.org>

A new fair task is detached and attached from/to task_group with:

  cgroup_post_fork()
    ss->fork(child) := cpu_cgroup_fork()
      sched_move_task()
        task_move_group_fair()

Which is wrong, because at this point in fork() the task isn't fully
initialized and it cannot 'move' to another group, because its not
attached to any group as yet.

In fact, cpu_cgroup_fork needs a small part of sched_move_task so we
can just call this small part directly instead sched_move_task. And
the task doesn't really migrate because it is not yet attached so we
need the sequence:

  do_fork()
    sched_fork()
      __set_task_cpu()

    cgroup_post_fork()
      set_task_rq() # set task group and runqueue

    wake_up_new_task()
      select_task_rq() can select a new cpu
      __set_task_cpu
      post_init_entity_util_avg
        attach_task_cfs_rq()
      activate_task
        enqueue_task

This patch makes that happen.

Maybe-Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   67 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 47 insertions(+), 20 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7743,27 +7743,17 @@ void sched_offline_group(struct task_gro
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
-/* change task's runqueue when it moves between groups.
- *	The caller of this function should have put the task in its new group
- *	by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- *	reflect its new group.
+/*
+ * Set task's runqueue and group.
+ *
+ * In case of a move between group, we update src and dst group thanks to
+ * sched_class->task_move_group. Otherwise, we just need to set runqueue and
+ * group pointers. The task will be attached to the runqueue during its wake
+ * up.
  */
-void sched_move_task(struct task_struct *tsk)
+static void sched_set_group(struct task_struct *tsk, bool move)
 {
 	struct task_group *tg;
-	int queued, running;
-	struct rq_flags rf;
-	struct rq *rq;
-
-	rq = task_rq_lock(tsk, &rf);
-
-	running = task_current(rq, tsk);
-	queued = task_on_rq_queued(tsk);
-
-	if (queued)
-		dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE);
-	if (unlikely(running))
-		put_prev_task(rq, tsk);
 
 	/*
 	 * All callers are synchronized by task_rq_lock(); we do not use RCU
@@ -7776,11 +7766,37 @@ void sched_move_task(struct task_struct
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (tsk->sched_class->task_move_group)
+	if (move && tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk);
 	else
 #endif
 		set_task_rq(tsk, task_cpu(tsk));
+}
+
+/*
+ * Change task's runqueue when it moves between groups.
+ *
+ * The caller of this function should have put the task in its new group by
+ * now. This function just updates tsk->se.cfs_rq and tsk->se.parent to reflect
+ * its new group.
+ */
+void sched_move_task(struct task_struct *tsk)
+{
+	int queued, running;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(tsk, &rf);
+
+	running = task_current(rq, tsk);
+	queued = task_on_rq_queued(tsk);
+
+	if (queued)
+		dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE);
+	if (unlikely(running))
+		put_prev_task(rq, tsk);
+
+	sched_set_group(tsk, true);
 
 	if (unlikely(running))
 		tsk->sched_class->set_curr_task(rq);
@@ -8208,9 +8224,20 @@ static void cpu_cgroup_css_free(struct c
 	sched_free_group(tg);
 }
 
+/*
+ * This is called before wake_up_new_task(), therefore we really only
+ * have to set its group bits, all the other stuff does not apply.
+ */
 static void cpu_cgroup_fork(struct task_struct *task)
 {
-	sched_move_task(task);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(task, &rf);
+
+	sched_set_group(task, false);
+
+	task_rq_unlock(rq, task, &rf);
 }
 
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 12:01 [PATCH 0/4] sched/fair: Fix PELT wobblies Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-06-17 12:01 ` [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork() Peter Zijlstra
@ 2016-06-17 12:01 ` Peter Zijlstra
  2016-06-17 14:09   ` Vincent Guittot
                     ` (2 more replies)
  3 siblings, 3 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot
  Cc: Peter Zijlstra

[-- Attachment #1: peterz-sched-fork-2.patch --]
[-- Type: text/plain, Size: 8105 bytes --]

Vincent and Yuyang found another few scenarios in which entity
tracking goes wobbly.

The scenarios are basically due to the fact that new tasks are not
immediately attached and thereby differ from the normal situation -- a
task is always attached to a cfs_rq load average (such that it
includes its blocked contribution) and are explicitly
detached/attached on migration to another cfs_rq.

Scenario 1: switch to fair class

  p->sched_class = fair_class;
  if (queued)
    enqueue_task(p);
      ...
        enqueue_entity()
	  enqueue_entity_load_avg()
	    migrated = !sa->last_update_time (true)
	    if (migrated)
	      attach_entity_load_avg()
  check_class_changed()
    switched_from() (!fair)
    switched_to()   (fair)
      switched_to_fair()
        attach_entity_load_avg()

If @p is a new task that hasn't been fair before, it will have
!last_update_time and, per the above, end up in
attach_entity_load_avg() _twice_.

Scenario 2: change between cgroups

  sched_move_group(p)
    if (queued)
      dequeue_task()
    task_move_group_fair()
      detach_task_cfs_rq()
        detach_entity_load_avg()
      set_task_rq()
      attach_task_cfs_rq()
        attach_entity_load_avg()
    if (queued)
      enqueue_task();
        ...
          enqueue_entity()
	    enqueue_entity_load_avg()
	      migrated = !sa->last_update_time (true)
	      if (migrated)
	        attach_entity_load_avg()

Similar as with scenario 1, if @p is a new task, it will have
!load_update_time and we'll end up in attach_entity_load_avg()
_twice_.

Furthermore, notice how we do a detach_entity_load_avg() on something
that wasn't attached to begin with.

As stated above; the problem is that the new task isn't yet attached
to the load tracking and thereby violates the invariant assumption.

This patch remedies this by ensuring a new task is indeed properly
attached to the load tracking on creation, through
post_init_entity_util_avg().

Of course, this isn't entirely as straight forward as one might think,
since the task is hashed before we call wake_up_new_task() and thus
can be poked at. We avoid this by adding TASK_NEW and teaching
cpu_cgroup_can_attach() to refuse such tasks.

Cc: Yuyang Du <yuyang.du@intel.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    5 +-
 kernel/sched/core.c   |   98 +++++++++++++++++++++++++++++++++++++-------------
 kernel/sched/fair.c   |   26 +++++++++----
 3 files changed, 94 insertions(+), 35 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -220,9 +220,10 @@ extern void proc_sched_set_task(struct t
 #define TASK_WAKING		256
 #define TASK_PARKED		512
 #define TASK_NOLOAD		1024
-#define TASK_STATE_MAX		2048
+#define TASK_NEW		2048
+#define TASK_STATE_MAX		4096
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPNn"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2340,11 +2340,11 @@ int sched_fork(unsigned long clone_flags
 
 	__sched_fork(clone_flags, p);
 	/*
-	 * We mark the process as running here. This guarantees that
+	 * We mark the process as NEW here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
 	 * event cannot wake it up and insert it on the runqueue either.
 	 */
-	p->state = TASK_RUNNING;
+	p->state = TASK_NEW;
 
 	/*
 	 * Make sure we do not leak PI boosting priority to the child.
@@ -2381,6 +2381,8 @@ int sched_fork(unsigned long clone_flags
 		p->sched_class = &fair_sched_class;
 	}
 
+	init_entity_runnable_average(&p->se);
+
 	/*
 	 * The child is not yet in the pid-hash so no cgroup attach races,
 	 * and the cgroup is pinned to this child due to cgroup_fork()
@@ -2389,6 +2391,10 @@ int sched_fork(unsigned long clone_flags
 	 * Silence PROVE_RCU.
 	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	/*
+	 * We're setting the cpu for the first time, we don't migrate,
+	 * so use __set_task_cpu().
+	 */
 	__set_task_cpu(p, cpu);
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
@@ -2523,16 +2529,18 @@ void wake_up_new_task(struct task_struct
 	struct rq_flags rf;
 	struct rq *rq;
 
-	/* Initialize new task's runnable average */
-	init_entity_runnable_average(&p->se);
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
+	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
 	/*
 	 * Fork balancing, do it here and not earlier because:
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
+	 *
+	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
+	 * as we're not fully set-up yet.
 	 */
-	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
 	post_init_entity_util_avg(&p->se);
@@ -8219,6 +8254,19 @@ static int cpu_cgroup_can_attach(struct
 		if (task->sched_class != &fair_sched_class)
 			return -EINVAL;
 #endif
+		/*
+		 * Serialize against wake_up_new_task() such
+		 * that if its running, we're sure to observe
+		 * its full state.
+		 */
+		raw_spin_unlock_wait(&task->pi_lock);
+		/*
+		 * Avoid calling sched_move_task() before wake_up_new_task()
+		 * has happened. This would lead to problems with PELT. See
+		 * XXX.
+		 */
+		if (task->state == TASK_NEW)
+			return -EINVAL;
 	}
 	return 0;
 }
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,6 +690,10 @@ void init_entity_runnable_average(struct
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
+static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
+static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
+
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
  * based on the cfs_rq's current util_avg:
@@ -733,18 +737,21 @@ void post_init_entity_util_avg(struct sc
 		}
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
+
+	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
+	attach_entity_load_avg(cfs_rq, se);
 }
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
-#else
+#else /* !CONFIG_SMP */
 void init_entity_runnable_average(struct sched_entity *se)
 {
 }
 void post_init_entity_util_avg(struct sched_entity *se)
 {
 }
-#endif
+#endif /* CONFIG_SMP */
 
 /*
  * Update the current task's runtime statistics.
@@ -2847,8 +2854,6 @@ void set_task_rq_fair(struct sched_entit
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-
 static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -2958,6 +2963,8 @@ static void attach_entity_load_avg(struc
 	/*
 	 * If we got migrated (either between CPUs or between cgroups) we'll
 	 * have aged the average right before clearing @last_update_time.
+	 *
+	 * Or we're fresh through post_init_entity_util_avg().
 	 */
 	if (se->avg.last_update_time) {
 		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
@@ -3063,11 +3070,14 @@ void remove_entity_load_avg(struct sched
 	u64 last_update_time;
 
 	/*
-	 * Newly created task or never used group entity should not be removed
-	 * from its (source) cfs_rq
+	 * tasks cannot exit without having gone through wake_up_new_task() ->
+	 * post_init_entity_util_avg() which will have added things to the
+	 * cfs_rq, so we can remove unconditionally.
+	 *
+	 * Similarly for groups, they will have passed through
+	 * post_init_entity_util_avg() before unregister_sched_fair_group()
+	 * calls this.
 	 */
-	if (se->avg.last_update_time == 0)
-		return;
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] sched/fair: Fix PELT integrity for new groups
  2016-06-17 12:01 ` [PATCH 2/4] sched/fair: Fix PELT integrity for new groups Peter Zijlstra
@ 2016-06-17 13:51   ` Vincent Guittot
  0 siblings, 0 replies; 39+ messages in thread
From: Vincent Guittot @ 2016-06-17 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 17 June 2016 at 14:01, Peter Zijlstra <peterz@infradead.org> wrote:
> Vincent reported that when a new task is moved into a new cgroup it

task doesn't have to be new only the cgroup

> gets attached twice to the load tracking.
>
>   sched_move_task()
>     task_move_group_fair()
>       detach_task_cfs_rq()
>       set_task_rq()
>       attach_task_cfs_rq()
>         attach_entity_load_avg()
>           se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0
>
>   enqueue_entity()
>     enqueue_entity_load_avg()
>       update_cfs_rq_load_avg()
>         now = clock()
>         __update_load_avg(&cfs_rq->avg)
>           cfs_rq->avg.last_load_update = now
>           // ages load/util for: now - 0, load/util -> 0
>       if (migrated)
>         attach_entity_load_avg()
>           se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0
>
> The problem is that we don't update cfs_rq load_avg before all
> entity attach/detach operations. Only enqueue and migrate_task do
> this.
>
> By fixing this, the above will not happen, because the
> sched_move_task() attach will have updated cfs_rq's last_load_update
> time before attach, and in turn the attach will have set the entity's
> last_load_update stamp.
>
> Note that there is a further problem with sched_move_task() calling
> detach on a task that hasn't yet been attached; this will be taken
> care of in a subsequent patch.

This patch fixes the double attach
Tested-by:  Vincent Guittot <vincent.guittot@linaro.org>

>
> Cc: Yuyang Du <yuyang.du@intel.com>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |    4 ++++
>  1 file changed, 4 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8366,6 +8366,7 @@ static void detach_task_cfs_rq(struct ta
>  {
>         struct sched_entity *se = &p->se;
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 now = cfs_rq_clock_task(cfs_rq);
>
>         if (!vruntime_normalized(p)) {
>                 /*
> @@ -8377,6 +8378,7 @@ static void detach_task_cfs_rq(struct ta
>         }
>
>         /* Catch up with the cfs_rq and remove our load when we leave */
> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>         detach_entity_load_avg(cfs_rq, se);
>  }
>
> @@ -8384,6 +8386,7 @@ static void attach_task_cfs_rq(struct ta
>  {
>         struct sched_entity *se = &p->se;
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 now = cfs_rq_clock_task(cfs_rq);
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         /*
> @@ -8394,6 +8397,7 @@ static void attach_task_cfs_rq(struct ta
>  #endif
>
>         /* Synchronize task with its cfs_rq */
> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>         attach_entity_load_avg(cfs_rq, se);
>
>         if (!vruntime_normalized(p))
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork()
  2016-06-17 12:01 ` [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork() Peter Zijlstra
@ 2016-06-17 13:58   ` Vincent Guittot
  2016-06-17 14:06     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Vincent Guittot @ 2016-06-17 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 17 June 2016 at 14:01, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Vincent Guittot <vincent.guittot@linaro.org>
>
> A new fair task is detached and attached from/to task_group with:
>
>   cgroup_post_fork()
>     ss->fork(child) := cpu_cgroup_fork()
>       sched_move_task()
>         task_move_group_fair()
>
> Which is wrong, because at this point in fork() the task isn't fully
> initialized and it cannot 'move' to another group, because its not
> attached to any group as yet.
>
> In fact, cpu_cgroup_fork needs a small part of sched_move_task so we
> can just call this small part directly instead sched_move_task. And
> the task doesn't really migrate because it is not yet attached so we
> need the sequence:
>
>   do_fork()
>     sched_fork()
>       __set_task_cpu()
>
>     cgroup_post_fork()
>       set_task_rq() # set task group and runqueue
>
>     wake_up_new_task()
>       select_task_rq() can select a new cpu
>       __set_task_cpu
>       post_init_entity_util_avg
>         attach_task_cfs_rq()
>       activate_task
>         enqueue_task
>
> This patch makes that happen.
>

With this patch and patch 1, the fork sequence looks correct in my test

> Maybe-Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

You can remove the Maybe if you want

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   67 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 47 insertions(+), 20 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7743,27 +7743,17 @@ void sched_offline_group(struct task_gro
>         spin_unlock_irqrestore(&task_group_lock, flags);
>  }
>
> -/* change task's runqueue when it moves between groups.
> - *     The caller of this function should have put the task in its new group
> - *     by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
> - *     reflect its new group.
> +/*
> + * Set task's runqueue and group.
> + *
> + * In case of a move between group, we update src and dst group thanks to
> + * sched_class->task_move_group. Otherwise, we just need to set runqueue and
> + * group pointers. The task will be attached to the runqueue during its wake
> + * up.
>   */
> -void sched_move_task(struct task_struct *tsk)
> +static void sched_set_group(struct task_struct *tsk, bool move)
>  {
>         struct task_group *tg;
> -       int queued, running;
> -       struct rq_flags rf;
> -       struct rq *rq;
> -
> -       rq = task_rq_lock(tsk, &rf);
> -
> -       running = task_current(rq, tsk);
> -       queued = task_on_rq_queued(tsk);
> -
> -       if (queued)
> -               dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE);
> -       if (unlikely(running))
> -               put_prev_task(rq, tsk);
>
>         /*
>          * All callers are synchronized by task_rq_lock(); we do not use RCU
> @@ -7776,11 +7766,37 @@ void sched_move_task(struct task_struct
>         tsk->sched_task_group = tg;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       if (tsk->sched_class->task_move_group)
> +       if (move && tsk->sched_class->task_move_group)
>                 tsk->sched_class->task_move_group(tsk);
>         else
>  #endif
>                 set_task_rq(tsk, task_cpu(tsk));
> +}
> +
> +/*
> + * Change task's runqueue when it moves between groups.
> + *
> + * The caller of this function should have put the task in its new group by
> + * now. This function just updates tsk->se.cfs_rq and tsk->se.parent to reflect
> + * its new group.
> + */
> +void sched_move_task(struct task_struct *tsk)
> +{
> +       int queued, running;
> +       struct rq_flags rf;
> +       struct rq *rq;
> +
> +       rq = task_rq_lock(tsk, &rf);
> +
> +       running = task_current(rq, tsk);
> +       queued = task_on_rq_queued(tsk);
> +
> +       if (queued)
> +               dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE);
> +       if (unlikely(running))
> +               put_prev_task(rq, tsk);
> +
> +       sched_set_group(tsk, true);
>
>         if (unlikely(running))
>                 tsk->sched_class->set_curr_task(rq);
> @@ -8208,9 +8224,20 @@ static void cpu_cgroup_css_free(struct c
>         sched_free_group(tg);
>  }
>
> +/*
> + * This is called before wake_up_new_task(), therefore we really only
> + * have to set its group bits, all the other stuff does not apply.
> + */
>  static void cpu_cgroup_fork(struct task_struct *task)
>  {
> -       sched_move_task(task);
> +       struct rq_flags rf;
> +       struct rq *rq;
> +
> +       rq = task_rq_lock(task, &rf);
> +
> +       sched_set_group(task, false);
> +
> +       task_rq_unlock(rq, task, &rf);
>  }
>
>  static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork()
  2016-06-17 13:58   ` Vincent Guittot
@ 2016-06-17 14:06     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 14:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Fri, Jun 17, 2016 at 03:58:37PM +0200, Vincent Guittot wrote:
> On 17 June 2016 at 14:01, Peter Zijlstra <peterz@infradead.org> wrote:
> > From: Vincent Guittot <vincent.guittot@linaro.org>
> >
> > A new fair task is detached and attached from/to task_group with:
> >
> >   cgroup_post_fork()
> >     ss->fork(child) := cpu_cgroup_fork()
> >       sched_move_task()
> >         task_move_group_fair()
> >
> > Which is wrong, because at this point in fork() the task isn't fully
> > initialized and it cannot 'move' to another group, because its not
> > attached to any group as yet.
> >
> > In fact, cpu_cgroup_fork needs a small part of sched_move_task so we
> > can just call this small part directly instead sched_move_task. And
> > the task doesn't really migrate because it is not yet attached so we
> > need the sequence:
> >
> >   do_fork()
> >     sched_fork()
> >       __set_task_cpu()
> >
> >     cgroup_post_fork()
> >       set_task_rq() # set task group and runqueue
> >
> >     wake_up_new_task()
> >       select_task_rq() can select a new cpu
> >       __set_task_cpu
> >       post_init_entity_util_avg
> >         attach_task_cfs_rq()
> >       activate_task
> >         enqueue_task
> >
> > This patch makes that happen.
> >
> 
> With this patch and patch 1, the fork sequence looks correct in my test
> 
> > Maybe-Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> You can remove the Maybe if you want

Thanks!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 12:01 ` [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks Peter Zijlstra
@ 2016-06-17 14:09   ` Vincent Guittot
  2016-06-17 14:28     ` Peter Zijlstra
  2016-06-23 11:19   ` Peter Zijlstra
  2016-08-01  7:30   ` Wanpeng Li
  2 siblings, 1 reply; 39+ messages in thread
From: Vincent Guittot @ 2016-06-17 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 17 June 2016 at 14:01, Peter Zijlstra <peterz@infradead.org> wrote:
> Vincent and Yuyang found another few scenarios in which entity
> tracking goes wobbly.
>
> The scenarios are basically due to the fact that new tasks are not
> immediately attached and thereby differ from the normal situation -- a
> task is always attached to a cfs_rq load average (such that it
> includes its blocked contribution) and are explicitly
> detached/attached on migration to another cfs_rq.
>
> Scenario 1: switch to fair class
>
>   p->sched_class = fair_class;
>   if (queued)
>     enqueue_task(p);
>       ...
>         enqueue_entity()
>           enqueue_entity_load_avg()
>             migrated = !sa->last_update_time (true)
>             if (migrated)
>               attach_entity_load_avg()
>   check_class_changed()
>     switched_from() (!fair)
>     switched_to()   (fair)
>       switched_to_fair()
>         attach_entity_load_avg()
>
> If @p is a new task that hasn't been fair before, it will have
> !last_update_time and, per the above, end up in
> attach_entity_load_avg() _twice_.
>
> Scenario 2: change between cgroups
>
>   sched_move_group(p)
>     if (queued)
>       dequeue_task()
>     task_move_group_fair()
>       detach_task_cfs_rq()
>         detach_entity_load_avg()
>       set_task_rq()
>       attach_task_cfs_rq()
>         attach_entity_load_avg()
>     if (queued)
>       enqueue_task();
>         ...
>           enqueue_entity()
>             enqueue_entity_load_avg()
>               migrated = !sa->last_update_time (true)
>               if (migrated)
>                 attach_entity_load_avg()
>
> Similar as with scenario 1, if @p is a new task, it will have
> !load_update_time and we'll end up in attach_entity_load_avg()
> _twice_.
>
> Furthermore, notice how we do a detach_entity_load_avg() on something
> that wasn't attached to begin with.
>
> As stated above; the problem is that the new task isn't yet attached
> to the load tracking and thereby violates the invariant assumption.
>
> This patch remedies this by ensuring a new task is indeed properly
> attached to the load tracking on creation, through
> post_init_entity_util_avg().
>
> Of course, this isn't entirely as straight forward as one might think,
> since the task is hashed before we call wake_up_new_task() and thus
> can be poked at. We avoid this by adding TASK_NEW and teaching
> cpu_cgroup_can_attach() to refuse such tasks.
>
> Cc: Yuyang Du <yuyang.du@intel.com>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
...
>
> +static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> +static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
> +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
> +
>  /*
>   * With new tasks being created, their initial util_avgs are extrapolated
>   * based on the cfs_rq's current util_avg:
> @@ -733,18 +737,21 @@ void post_init_entity_util_avg(struct sc
>                 }
>                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>         }
> +
> +       update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> +       attach_entity_load_avg(cfs_rq, se);

A new RT task will be attached and will contribute to the load until
it decays to 0
Should we detach it for non cfs task ? We just want to update
last_update_time of RT task to something different from 0

>  }
>
>  static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
>  static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
> -#else
> +#else /* !CONFIG_SMP */
>  void init_entity_runnable_average(struct sched_entity *se)
>  {
>  }
>  void post_init_entity_util_avg(struct sched_entity *se)
>  {
>  }
> -#endif
> +#endif /* CONFIG_SMP */
>
>  /*
>   * Update the current task's runtime statistics.
> @@ -2847,8 +2854,6 @@ void set_task_rq_fair(struct sched_entit
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> -
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>         struct rq *rq = rq_of(cfs_rq);
> @@ -2958,6 +2963,8 @@ static void attach_entity_load_avg(struc
>         /*
>          * If we got migrated (either between CPUs or between cgroups) we'll
>          * have aged the average right before clearing @last_update_time.
> +        *
> +        * Or we're fresh through post_init_entity_util_avg().
>          */
>         if (se->avg.last_update_time) {
>                 __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
> @@ -3063,11 +3070,14 @@ void remove_entity_load_avg(struct sched
>         u64 last_update_time;
>
>         /*
> -        * Newly created task or never used group entity should not be removed
> -        * from its (source) cfs_rq
> +        * tasks cannot exit without having gone through wake_up_new_task() ->
> +        * post_init_entity_util_avg() which will have added things to the
> +        * cfs_rq, so we can remove unconditionally.
> +        *
> +        * Similarly for groups, they will have passed through
> +        * post_init_entity_util_avg() before unregister_sched_fair_group()
> +        * calls this.
>          */
> -       if (se->avg.last_update_time == 0)
> -               return;
>
>         last_update_time = cfs_rq_last_update_time(cfs_rq);
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 14:09   ` Vincent Guittot
@ 2016-06-17 14:28     ` Peter Zijlstra
  2016-06-17 16:02       ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 14:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Fri, Jun 17, 2016 at 04:09:01PM +0200, Vincent Guittot wrote:
> >
> > Scenario 1: switch to fair class
> >
> >   p->sched_class = fair_class;
> >   if (queued)
> >     enqueue_task(p);
> >       ...
> >         enqueue_entity()
> >           enqueue_entity_load_avg()
> >             migrated = !sa->last_update_time (true)
> >             if (migrated)
> >               attach_entity_load_avg()
> >   check_class_changed()
> >     switched_from() (!fair)
> >     switched_to()   (fair)
> >       switched_to_fair()
> >         attach_entity_load_avg()

> > @@ -733,18 +737,21 @@ void post_init_entity_util_avg(struct sc
> >                 }
> >                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> >         }
> > +
> > +       update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> > +       attach_entity_load_avg(cfs_rq, se);
> 
> A new RT task will be attached and will contribute to the load until
> it decays to 0
> Should we detach it for non cfs task ? We just want to update
> last_update_time of RT task to something different from 0

Won't we open up that (scenario 1) hole again? Maybe we can initialize
!fair tasks to 0 load and have them ramp up once they get demoted back
to fair?

Blergh, I'll go ponder moar.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 14:28     ` Peter Zijlstra
@ 2016-06-17 16:02       ` Peter Zijlstra
  2016-06-17 16:14         ` Vincent Guittot
  2016-06-17 16:18         ` Peter Zijlstra
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 16:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Fri, Jun 17, 2016 at 04:28:14PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 17, 2016 at 04:09:01PM +0200, Vincent Guittot wrote:
> > >
> > > Scenario 1: switch to fair class
> > >
> > >   p->sched_class = fair_class;
> > >   if (queued)
> > >     enqueue_task(p);
> > >       ...
> > >         enqueue_entity()
> > >           enqueue_entity_load_avg()
> > >             migrated = !sa->last_update_time (true)
> > >             if (migrated)
> > >               attach_entity_load_avg()
> > >   check_class_changed()
> > >     switched_from() (!fair)
> > >     switched_to()   (fair)
> > >       switched_to_fair()
> > >         attach_entity_load_avg()
> 
> > > @@ -733,18 +737,21 @@ void post_init_entity_util_avg(struct sc
> > >                 }
> > >                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> > >         }
> > > +
> > > +       update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> > > +       attach_entity_load_avg(cfs_rq, se);
> > 
> > A new RT task will be attached and will contribute to the load until
> > it decays to 0

> > Should we detach it for non cfs task ?

Right, an attach + detach, which basically ends up being:

> > We just want to update
> > last_update_time of RT task to something different from 0

this. That's the same as starting as fair and then doing
switched_from_fair().

So yes, ho-humm, how to go about doing that bestest. Lemme have a play.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 16:02       ` Peter Zijlstra
@ 2016-06-17 16:14         ` Vincent Guittot
  2016-06-17 16:18         ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Vincent Guittot @ 2016-06-17 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 17 June 2016 at 18:02, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jun 17, 2016 at 04:28:14PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 17, 2016 at 04:09:01PM +0200, Vincent Guittot wrote:
>> > >
>> > > Scenario 1: switch to fair class
>> > >
>> > >   p->sched_class = fair_class;
>> > >   if (queued)
>> > >     enqueue_task(p);
>> > >       ...
>> > >         enqueue_entity()
>> > >           enqueue_entity_load_avg()
>> > >             migrated = !sa->last_update_time (true)
>> > >             if (migrated)
>> > >               attach_entity_load_avg()
>> > >   check_class_changed()
>> > >     switched_from() (!fair)
>> > >     switched_to()   (fair)
>> > >       switched_to_fair()
>> > >         attach_entity_load_avg()
>>
>> > > @@ -733,18 +737,21 @@ void post_init_entity_util_avg(struct sc
>> > >                 }
>> > >                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>> > >         }
>> > > +
>> > > +       update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
>> > > +       attach_entity_load_avg(cfs_rq, se);
>> >
>> > A new RT task will be attached and will contribute to the load until
>> > it decays to 0
>
>> > Should we detach it for non cfs task ?
>
> Right, an attach + detach, which basically ends up being:
>
>> > We just want to update
>> > last_update_time of RT task to something different from 0
>
> this. That's the same as starting as fair and then doing
> switched_from_fair().

Yes that's also a possibility to add a way to call switched_from_fair
for non fair task

>
> So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 16:02       ` Peter Zijlstra
  2016-06-17 16:14         ` Vincent Guittot
@ 2016-06-17 16:18         ` Peter Zijlstra
  2016-06-19 22:55           ` Yuyang Du
                             ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-17 16:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
> So yes, ho-humm, how to go about doing that bestest. Lemme have a play.

This is what I came up with, not entirely pretty, but I suppose it'll
have to do.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
 	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
-	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+		if (p->sched_class != &fair_sched_class) {
+			/*
+			 * For !fair tasks do attach_entity_load_avg()
+			 * followed by detach_entity_load_avg() as per
+			 * switched_from_fair().
+			 */
+			se->avg.last_update_time = now;
+			return;
+		}
+	}
+
+	update_cfs_rq_load_avg(now, cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 }
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 16:18         ` Peter Zijlstra
@ 2016-06-19 22:55           ` Yuyang Du
  2016-06-20  9:23           ` Vincent Guittot
  2016-06-20 11:35           ` Dietmar Eggemann
  2 siblings, 0 replies; 39+ messages in thread
From: Yuyang Du @ 2016-06-19 22:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Fri, Jun 17, 2016 at 06:18:31PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
> > So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
> 
> This is what I came up with, not entirely pretty, but I suppose it'll
> have to do.
 
FWIW, I don't think the entire fix is pretty, so I will post my amended
fix still later.

> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  	struct sched_avg *sa = &se->avg;
>  	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
> +	u64 now = cfs_rq_clock_task(cfs_rq);
>  
>  	if (cap > 0) {
>  		if (cfs_rq->avg.util_avg != 0) {
> @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
>  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>  	}
>  
> -	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> +	if (entity_is_task(se)) {
> +		struct task_struct *p = task_of(se);
> +		if (p->sched_class != &fair_sched_class) {
> +			/*
> +			 * For !fair tasks do attach_entity_load_avg()
> +			 * followed by detach_entity_load_avg() as per
> +			 * switched_from_fair().
> +			 */
> +			se->avg.last_update_time = now;
> +			return;
> +		}
> +	}
> +
> +	update_cfs_rq_load_avg(now, cfs_rq, false);
>  	attach_entity_load_avg(cfs_rq, se);
>  }
>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 16:18         ` Peter Zijlstra
  2016-06-19 22:55           ` Yuyang Du
@ 2016-06-20  9:23           ` Vincent Guittot
  2016-06-20  9:52             ` Peter Zijlstra
  2016-06-21 11:43             ` Peter Zijlstra
  2016-06-20 11:35           ` Dietmar Eggemann
  2 siblings, 2 replies; 39+ messages in thread
From: Vincent Guittot @ 2016-06-20  9:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

Le Friday 17 Jun 2016 à 18:18:31 (+0200), Peter Zijlstra a écrit :
> On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
> > So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
> 
> This is what I came up with, not entirely pretty, but I suppose it'll
> have to do.
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  	struct sched_avg *sa = &se->avg;
>  	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
> +	u64 now = cfs_rq_clock_task(cfs_rq);
>  
>  	if (cap > 0) {
>  		if (cfs_rq->avg.util_avg != 0) {
> @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
>  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>  	}
>  
> -	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> +	if (entity_is_task(se)) {

Why only task ?

> +		struct task_struct *p = task_of(se);
> +		if (p->sched_class != &fair_sched_class) {
> +			/*
> +			 * For !fair tasks do attach_entity_load_avg()
> +			 * followed by detach_entity_load_avg() as per
> +			 * switched_from_fair().
> +			 */
> +			se->avg.last_update_time = now;
> +			return;
> +		}
> +	}
> +
> +	update_cfs_rq_load_avg(now, cfs_rq, false);
>  	attach_entity_load_avg(cfs_rq, se);

Don't we have to do a complete attach with attach_task_cfs_rq instead of just the load_avg ? to set also depth ?

What about something like below ?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -723,6 +723,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
 	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -737,8 +738,18 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
-	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
-	attach_entity_load_avg(cfs_rq, se);
+	if (p->sched_class == &fair_sched_class) {
+		/* fair entity must be attached to cfs_rq */
+		attach_task_cfs_rq(se);
+	} else {
+		/*
+		 * For !fair tasks do attach_entity_load_avg()
+		 * followed by detach_entity_load_avg() as per
+		 * switched_from_fair().
+		 */
+		se->avg.last_update_time = now;
+	}
+
 }
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20  9:23           ` Vincent Guittot
@ 2016-06-20  9:52             ` Peter Zijlstra
  2016-06-20 10:07               ` Vincent Guittot
  2016-06-21 11:43             ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-20  9:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Mon, Jun 20, 2016 at 11:23:39AM +0200, Vincent Guittot wrote:

> > @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
> >  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> >  	}
> >  
> > -	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> > +	if (entity_is_task(se)) {
> 
> Why only task ?

Only tasks can be of another class, the cgroup se's must be fair.

> > +		struct task_struct *p = task_of(se);
> > +		if (p->sched_class != &fair_sched_class) {
> > +			/*
> > +			 * For !fair tasks do attach_entity_load_avg()
> > +			 * followed by detach_entity_load_avg() as per
> > +			 * switched_from_fair().
> > +			 */
> > +			se->avg.last_update_time = now;
> > +			return;
> > +		}
> > +	}
> > +
> > +	update_cfs_rq_load_avg(now, cfs_rq, false);
> >  	attach_entity_load_avg(cfs_rq, se);
> 
> Don't we have to do a complete attach with attach_task_cfs_rq instead
> of just the load_avg ? to set also depth ?

init_tg_cfs_entity() sets depth for cgroup entities,
attach_task_cfs_rq() reset depth on switched_to().

So that should be fine.

> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -723,6 +723,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  	struct sched_avg *sa = &se->avg;
>  	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
> +	u64 now = cfs_rq_clock_task(cfs_rq);
>  
>  	if (cap > 0) {
>  		if (cfs_rq->avg.util_avg != 0) {
> @@ -737,8 +738,18 @@ void post_init_entity_util_avg(struct sched_entity *se)
>  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>  	}
>  
> -	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> -	attach_entity_load_avg(cfs_rq, se);

There is no p here...

> +	if (p->sched_class == &fair_sched_class) {
> +		/* fair entity must be attached to cfs_rq */
> +		attach_task_cfs_rq(se);
> +	} else {
> +		/*
> +		 * For !fair tasks do attach_entity_load_avg()
> +		 * followed by detach_entity_load_avg() as per
> +		 * switched_from_fair().
> +		 */
> +		se->avg.last_update_time = now;
> +	}
> +
>  }

And you're now not attaching new cgroup thingies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20  9:52             ` Peter Zijlstra
@ 2016-06-20 10:07               ` Vincent Guittot
  0 siblings, 0 replies; 39+ messages in thread
From: Vincent Guittot @ 2016-06-20 10:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 20 June 2016 at 11:52, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 20, 2016 at 11:23:39AM +0200, Vincent Guittot wrote:
>
>> > @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
>> >             sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>> >     }
>> >
>> > -   update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
>> > +   if (entity_is_task(se)) {
>>
>> Why only task ?
>
> Only tasks can be of another class, the cgroup se's must be fair.

ok

>
>> > +           struct task_struct *p = task_of(se);
>> > +           if (p->sched_class != &fair_sched_class) {
>> > +                   /*
>> > +                    * For !fair tasks do attach_entity_load_avg()
>> > +                    * followed by detach_entity_load_avg() as per
>> > +                    * switched_from_fair().
>> > +                    */
>> > +                   se->avg.last_update_time = now;
>> > +                   return;
>> > +           }
>> > +   }
>> > +
>> > +   update_cfs_rq_load_avg(now, cfs_rq, false);
>> >     attach_entity_load_avg(cfs_rq, se);
>>
>> Don't we have to do a complete attach with attach_task_cfs_rq instead
>> of just the load_avg ? to set also depth ?
>
> init_tg_cfs_entity() sets depth for cgroup entities,
> attach_task_cfs_rq() reset depth on switched_to().
>
> So that should be fine.

But for new fair task ? it will not be set

>
>> ---
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -723,6 +723,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
>>       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>       struct sched_avg *sa = &se->avg;
>>       long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>> +     u64 now = cfs_rq_clock_task(cfs_rq);
>>
>>       if (cap > 0) {
>>               if (cfs_rq->avg.util_avg != 0) {
>> @@ -737,8 +738,18 @@ void post_init_entity_util_avg(struct sched_entity *se)
>>               sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>>       }
>>
>> -     update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
>> -     attach_entity_load_avg(cfs_rq, se);
>
> There is no p here...

yes my condition is just wrong

>
>> +     if (p->sched_class == &fair_sched_class) {
>> +             /* fair entity must be attached to cfs_rq */
>> +             attach_task_cfs_rq(se);
>> +     } else {
>> +             /*
>> +              * For !fair tasks do attach_entity_load_avg()
>> +              * followed by detach_entity_load_avg() as per
>> +              * switched_from_fair().
>> +              */
>> +             se->avg.last_update_time = now;
>> +     }
>> +
>>  }
>
> And you're now not attaching new cgroup thingies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 16:18         ` Peter Zijlstra
  2016-06-19 22:55           ` Yuyang Du
  2016-06-20  9:23           ` Vincent Guittot
@ 2016-06-20 11:35           ` Dietmar Eggemann
  2016-06-20 12:35             ` Vincent Guittot
  2016-06-20 14:27             ` Peter Zijlstra
  2 siblings, 2 replies; 39+ messages in thread
From: Dietmar Eggemann @ 2016-06-20 11:35 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Matt Fleming



On 17/06/16 17:18, Peter Zijlstra wrote:
> On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
>> So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
> 
> This is what I came up with, not entirely pretty, but I suppose it'll
> have to do.
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  	struct sched_avg *sa = &se->avg;
>  	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
> +	u64 now = cfs_rq_clock_task(cfs_rq);
>  
>  	if (cap > 0) {
>  		if (cfs_rq->avg.util_avg != 0) {
> @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
>  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>  	}
>  
> -	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> +	if (entity_is_task(se)) {
> +		struct task_struct *p = task_of(se);
> +		if (p->sched_class != &fair_sched_class) {
> +			/*
> +			 * For !fair tasks do attach_entity_load_avg()
> +			 * followed by detach_entity_load_avg() as per
> +			 * switched_from_fair().
> +			 */
> +			se->avg.last_update_time = now;
> +			return;
> +		}
> +	}
> +
> +	update_cfs_rq_load_avg(now, cfs_rq, false);
>  	attach_entity_load_avg(cfs_rq, se);
>  }
>  
> 

Doesn't a sleeping !fair_sched_class task which switches to fair uses
try_to_wake_up() rather than wake_up_new_task() so it won't go through
post_init_entity_util_avg()?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20 11:35           ` Dietmar Eggemann
@ 2016-06-20 12:35             ` Vincent Guittot
  2016-06-20 14:49               ` Dietmar Eggemann
  2016-06-20 14:27             ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Vincent Guittot @ 2016-06-20 12:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On 20 June 2016 at 13:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
>
> On 17/06/16 17:18, Peter Zijlstra wrote:
>> On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
>>> So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
>>
>> This is what I came up with, not entirely pretty, but I suppose it'll
>> have to do.
>>
>> ---
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
>>       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>       struct sched_avg *sa = &se->avg;
>>       long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>> +     u64 now = cfs_rq_clock_task(cfs_rq);
>>
>>       if (cap > 0) {
>>               if (cfs_rq->avg.util_avg != 0) {
>> @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
>>               sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>>       }
>>
>> -     update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
>> +     if (entity_is_task(se)) {
>> +             struct task_struct *p = task_of(se);
>> +             if (p->sched_class != &fair_sched_class) {
>> +                     /*
>> +                      * For !fair tasks do attach_entity_load_avg()
>> +                      * followed by detach_entity_load_avg() as per
>> +                      * switched_from_fair().
>> +                      */
>> +                     se->avg.last_update_time = now;
>> +                     return;
>> +             }
>> +     }
>> +
>> +     update_cfs_rq_load_avg(now, cfs_rq, false);
>>       attach_entity_load_avg(cfs_rq, se);
>>  }
>>
>>
>
> Doesn't a sleeping !fair_sched_class task which switches to fair uses
> try_to_wake_up() rather than wake_up_new_task() so it won't go through
> post_init_entity_util_avg()?

It will go through wake_up_new_task and post_init_entity_util_avg
during its fork which is enough to set last_update_time. Then, it will
use the switched_to_fair if the task becomes a fair one

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20 11:35           ` Dietmar Eggemann
  2016-06-20 12:35             ` Vincent Guittot
@ 2016-06-20 14:27             ` Peter Zijlstra
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-20 14:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Mon, Jun 20, 2016 at 12:35:59PM +0100, Dietmar Eggemann wrote:
> 
> 
> On 17/06/16 17:18, Peter Zijlstra wrote:
> > On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
> >> So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
> > 
> > This is what I came up with, not entirely pretty, but I suppose it'll
> > have to do.
> > 
> > ---
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
> >  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >  	struct sched_avg *sa = &se->avg;
> >  	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
> > +	u64 now = cfs_rq_clock_task(cfs_rq);
> >  
> >  	if (cap > 0) {
> >  		if (cfs_rq->avg.util_avg != 0) {
> > @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
> >  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> >  	}
> >  
> > -	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
> > +	if (entity_is_task(se)) {
> > +		struct task_struct *p = task_of(se);
> > +		if (p->sched_class != &fair_sched_class) {
> > +			/*
> > +			 * For !fair tasks do attach_entity_load_avg()
> > +			 * followed by detach_entity_load_avg() as per
> > +			 * switched_from_fair().
> > +			 */
> > +			se->avg.last_update_time = now;
> > +			return;
> > +		}
> > +	}
> > +
> > +	update_cfs_rq_load_avg(now, cfs_rq, false);
> >  	attach_entity_load_avg(cfs_rq, se);
> >  }
> >  
> > 
> 
> Doesn't a sleeping !fair_sched_class task which switches to fair uses
> try_to_wake_up() rather than wake_up_new_task() so it won't go through
> post_init_entity_util_avg()?

Its switched_to_fair() that'll fix that up.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20 12:35             ` Vincent Guittot
@ 2016-06-20 14:49               ` Dietmar Eggemann
  2016-06-21  8:41                 ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Dietmar Eggemann @ 2016-06-20 14:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On 20/06/16 13:35, Vincent Guittot wrote:
> On 20 June 2016 at 13:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>>
>> On 17/06/16 17:18, Peter Zijlstra wrote:
>>> On Fri, Jun 17, 2016 at 06:02:39PM +0200, Peter Zijlstra wrote:
>>>> So yes, ho-humm, how to go about doing that bestest. Lemme have a play.
>>>
>>> This is what I came up with, not entirely pretty, but I suppose it'll
>>> have to do.
>>>
>>> ---
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -724,6 +724,7 @@ void post_init_entity_util_avg(struct sc
>>>       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>       struct sched_avg *sa = &se->avg;
>>>       long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>>> +     u64 now = cfs_rq_clock_task(cfs_rq);
>>>
>>>       if (cap > 0) {
>>>               if (cfs_rq->avg.util_avg != 0) {
>>> @@ -738,7 +739,20 @@ void post_init_entity_util_avg(struct sc
>>>               sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>>>       }
>>>
>>> -     update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, false);
>>> +     if (entity_is_task(se)) {
>>> +             struct task_struct *p = task_of(se);
>>> +             if (p->sched_class != &fair_sched_class) {
>>> +                     /*
>>> +                      * For !fair tasks do attach_entity_load_avg()
>>> +                      * followed by detach_entity_load_avg() as per
>>> +                      * switched_from_fair().
>>> +                      */
>>> +                     se->avg.last_update_time = now;
>>> +                     return;
>>> +             }
>>> +     }
>>> +
>>> +     update_cfs_rq_load_avg(now, cfs_rq, false);
>>>       attach_entity_load_avg(cfs_rq, se);
>>>  }
>>>
>>>
>>
>> Doesn't a sleeping !fair_sched_class task which switches to fair uses
>> try_to_wake_up() rather than wake_up_new_task() so it won't go through
>> post_init_entity_util_avg()?
> 
> It will go through wake_up_new_task and post_init_entity_util_avg
> during its fork which is enough to set last_update_time. Then, it will
> use the switched_to_fair if the task becomes a fair one

Oh I see. We want to make sure that every task (even when forked as
!fair) has a last_update_time value != 0, when becoming fair one day.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21  8:41                 ` Peter Zijlstra
@ 2016-06-21  4:51                   ` Yuyang Du
  2016-06-24 13:03                     ` Peter Zijlstra
  2016-06-21 13:17                   ` Peter Zijlstra
  2016-06-23 15:35                   ` Dietmar Eggemann
  2 siblings, 1 reply; 39+ messages in thread
From: Yuyang Du @ 2016-06-21  4:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Tue, Jun 21, 2016 at 10:41:19AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 20, 2016 at 03:49:34PM +0100, Dietmar Eggemann wrote:
> > On 20/06/16 13:35, Vincent Guittot wrote:
> 
> > > It will go through wake_up_new_task and post_init_entity_util_avg
> > > during its fork which is enough to set last_update_time. Then, it will
> > > use the switched_to_fair if the task becomes a fair one
> > 
> > Oh I see. We want to make sure that every task (even when forked as
> > !fair) has a last_update_time value != 0, when becoming fair one day.
> 
> Right, see 2 below. I need to write a bunch of comments explaining PELT
> proper, as well as document these things.
> 
> The things we ran into with these patches were that:
> 
>  1) You need to update the cfs_rq _before_ any entity attach/detach
>     (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
>     returns true).

This is intrinsically an additional update, not a fix to anything. I
don't think it is a must, but I am fine with it.
 
>  2) (fair) entities are always attached, switched_from/to deal with !fair.
 
Yes, me too.

>  3) cpu migration is the only exception and uses the last_update_time=0
>     thing -- because refusal to take second rq->lock.

Task's last_update_time means this task is detached from fair queue. This
(re)definition is by all means much better than migrating. No?

> Which is why I dislike Yuyang's patches, they create more exceptions
> instead of applying existing rules (albeit undocumented).
> 
> Esp. 1 is important, because while for mathematically consistency you
> don't actually need to do this, you only need the entities to be
> up-to-date with the cfs rq when you attach/detach, but that forgets the
> temporal aspect of _when_ you do this.
 
Yes, temporally at any instant the avgs are outdated. But, I can have it,
and what if I have it?

I am thinking about document this really well, like "An art of load tracking:
accuracy, overhead, and usefulness", seriously.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20 14:49               ` Dietmar Eggemann
@ 2016-06-21  8:41                 ` Peter Zijlstra
  2016-06-21  4:51                   ` Yuyang Du
                                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-21  8:41 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Mon, Jun 20, 2016 at 03:49:34PM +0100, Dietmar Eggemann wrote:
> On 20/06/16 13:35, Vincent Guittot wrote:

> > It will go through wake_up_new_task and post_init_entity_util_avg
> > during its fork which is enough to set last_update_time. Then, it will
> > use the switched_to_fair if the task becomes a fair one
> 
> Oh I see. We want to make sure that every task (even when forked as
> !fair) has a last_update_time value != 0, when becoming fair one day.

Right, see 2 below. I need to write a bunch of comments explaining PELT
proper, as well as document these things.

The things we ran into with these patches were that:

 1) You need to update the cfs_rq _before_ any entity attach/detach
    (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
    returns true).

 2) (fair) entities are always attached, switched_from/to deal with !fair.

 3) cpu migration is the only exception and uses the last_update_time=0
    thing -- because refusal to take second rq->lock.

Which is why I dislike Yuyang's patches, they create more exceptions
instead of applying existing rules (albeit undocumented).

Esp. 1 is important, because while for mathematically consistency you
don't actually need to do this, you only need the entities to be
up-to-date with the cfs rq when you attach/detach, but that forgets the
temporal aspect of _when_ you do this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-20  9:23           ` Vincent Guittot
  2016-06-20  9:52             ` Peter Zijlstra
@ 2016-06-21 11:43             ` Peter Zijlstra
  2016-06-21 12:36               ` Vincent Guittot
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-21 11:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Mon, Jun 20, 2016 at 11:23:39AM +0200, Vincent Guittot wrote:

> Don't we have to do a complete attach with attach_task_cfs_rq instead
> of just the load_avg ? to set also depth ?

Hmm, yes, your sched_set_group() change seems to have munged this.

Previously we'd call task_move_group_fair() which would indeed setup
depth.

I've changed it thus (all smaller edits just didn't look right):


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7743,7 +7743,7 @@ void sched_offline_group(struct task_gro
  * group pointers. The task will be attached to the runqueue during its wake
  * up.
  */
-static void sched_set_group(struct task_struct *tsk, bool move)
+static void sched_change_group(struct task_struct *tsk, int type)
 {
 	struct task_group *tg;
 
@@ -7758,8 +7758,8 @@ static void sched_set_group(struct task_
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (move && tsk->sched_class->task_move_group)
-		tsk->sched_class->task_move_group(tsk);
+	if (tsk->sched_class->task_change_group)
+		tsk->sched_class->task_change_group(tsk, type);
 	else
 #endif
 		set_task_rq(tsk, task_cpu(tsk));
@@ -7788,7 +7788,7 @@ void sched_move_task(struct task_struct
 	if (unlikely(running))
 		put_prev_task(rq, tsk);
 
-	sched_set_group(tsk, true);
+	sched_change_group(tsk, TASK_MOVE_GROUP);
 
 	if (unlikely(running))
 		tsk->sched_class->set_curr_task(rq);
@@ -8227,7 +8227,7 @@ static void cpu_cgroup_fork(struct task_
 
 	rq = task_rq_lock(task, &rf);
 
-	sched_set_group(task, false);
+	sched_change_group(task, TASK_SET_GROUP);
 
 	task_rq_unlock(rq, task, &rf);
 }
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8443,6 +8443,14 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+static void task_set_group_fair(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+
+	set_task_rq(p, task_cpu(p));
+	se->depth = se->parent ? se->parent->depth + 1 : 0;
+}
+
 static void task_move_group_fair(struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
@@ -8455,6 +8463,19 @@ static void task_move_group_fair(struct
 	attach_task_cfs_rq(p);
 }
 
+static void task_change_group_fair(struct task_struct *p, int type)
+{
+	switch (type) {
+	case TASK_SET_GROUP:
+		task_set_group_fair(p);
+		break;
+
+	case TASK_MOVE_GROUP:
+		task_move_group_fair(p);
+		break;
+	}
+}
+
 void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
@@ -8683,7 +8704,7 @@ const struct sched_class fair_sched_clas
 	.update_curr		= update_curr_fair,
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	.task_move_group	= task_move_group_fair,
+	.task_change_group	= task_change_group_fair,
 #endif
 };
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1247,7 +1247,10 @@ struct sched_class {
 	void (*update_curr) (struct rq *rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	void (*task_move_group) (struct task_struct *p);
+#define TASK_SET_GROUP  0
+#define TASK_MOVE_GROUP	1
+
+	void (*task_change_group) (struct task_struct *p, int type);
 #endif
 };
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21 11:43             ` Peter Zijlstra
@ 2016-06-21 12:36               ` Vincent Guittot
  2016-06-21 12:47                 ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Vincent Guittot @ 2016-06-21 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 21 June 2016 at 13:43, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 20, 2016 at 11:23:39AM +0200, Vincent Guittot wrote:
>
>> Don't we have to do a complete attach with attach_task_cfs_rq instead
>> of just the load_avg ? to set also depth ?
>
> Hmm, yes, your sched_set_group() change seems to have munged this.
>

I think that it was done by the attach_task_cfs_rq during the activate_task.
Now, the attach is done in post_init_entity_util_avg. Can't we just
set the depth in post_init_entity_util_avg ?

> Previously we'd call task_move_group_fair() which would indeed setup
> depth.
>
> I've changed it thus (all smaller edits just didn't look right):
>
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7743,7 +7743,7 @@ void sched_offline_group(struct task_gro
>   * group pointers. The task will be attached to the runqueue during its wake
>   * up.
>   */
> -static void sched_set_group(struct task_struct *tsk, bool move)
> +static void sched_change_group(struct task_struct *tsk, int type)
>  {
>         struct task_group *tg;
>
> @@ -7758,8 +7758,8 @@ static void sched_set_group(struct task_
>         tsk->sched_task_group = tg;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       if (move && tsk->sched_class->task_move_group)
> -               tsk->sched_class->task_move_group(tsk);
> +       if (tsk->sched_class->task_change_group)
> +               tsk->sched_class->task_change_group(tsk, type);
>         else
>  #endif
>                 set_task_rq(tsk, task_cpu(tsk));
> @@ -7788,7 +7788,7 @@ void sched_move_task(struct task_struct
>         if (unlikely(running))
>                 put_prev_task(rq, tsk);
>
> -       sched_set_group(tsk, true);
> +       sched_change_group(tsk, TASK_MOVE_GROUP);
>
>         if (unlikely(running))
>                 tsk->sched_class->set_curr_task(rq);
> @@ -8227,7 +8227,7 @@ static void cpu_cgroup_fork(struct task_
>
>         rq = task_rq_lock(task, &rf);
>
> -       sched_set_group(task, false);
> +       sched_change_group(task, TASK_SET_GROUP);
>
>         task_rq_unlock(rq, task, &rf);
>  }
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8443,6 +8443,14 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +static void task_set_group_fair(struct task_struct *p)
> +{
> +       struct sched_entity *se = &p->se;
> +
> +       set_task_rq(p, task_cpu(p));
> +       se->depth = se->parent ? se->parent->depth + 1 : 0;
> +}
> +
>  static void task_move_group_fair(struct task_struct *p)
>  {
>         detach_task_cfs_rq(p);
> @@ -8455,6 +8463,19 @@ static void task_move_group_fair(struct
>         attach_task_cfs_rq(p);
>  }
>
> +static void task_change_group_fair(struct task_struct *p, int type)
> +{
> +       switch (type) {
> +       case TASK_SET_GROUP:
> +               task_set_group_fair(p);
> +               break;
> +
> +       case TASK_MOVE_GROUP:
> +               task_move_group_fair(p);
> +               break;
> +       }
> +}
> +
>  void free_fair_sched_group(struct task_group *tg)
>  {
>         int i;
> @@ -8683,7 +8704,7 @@ const struct sched_class fair_sched_clas
>         .update_curr            = update_curr_fair,
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       .task_move_group        = task_move_group_fair,
> +       .task_change_group      = task_change_group_fair,
>  #endif
>  };
>
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1247,7 +1247,10 @@ struct sched_class {
>         void (*update_curr) (struct rq *rq);
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       void (*task_move_group) (struct task_struct *p);
> +#define TASK_SET_GROUP  0
> +#define TASK_MOVE_GROUP        1
> +
> +       void (*task_change_group) (struct task_struct *p, int type);
>  #endif
>  };
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21 12:36               ` Vincent Guittot
@ 2016-06-21 12:47                 ` Peter Zijlstra
  2016-06-21 12:56                   ` Vincent Guittot
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-21 12:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On Tue, Jun 21, 2016 at 02:36:46PM +0200, Vincent Guittot wrote:
> On 21 June 2016 at 13:43, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Jun 20, 2016 at 11:23:39AM +0200, Vincent Guittot wrote:
> >
> >> Don't we have to do a complete attach with attach_task_cfs_rq instead
> >> of just the load_avg ? to set also depth ?
> >
> > Hmm, yes, your sched_set_group() change seems to have munged this.
> >
> 
> I think that it was done by the attach_task_cfs_rq during the activate_task.
> Now, the attach is done in post_init_entity_util_avg. Can't we just
> set the depth in post_init_entity_util_avg ?

Yes, I actually had that patch for a little while.

But since its cgroup specific, I felt it should be in the cgroup code,
hence the current patch.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21 12:47                 ` Peter Zijlstra
@ 2016-06-21 12:56                   ` Vincent Guittot
  0 siblings, 0 replies; 39+ messages in thread
From: Vincent Guittot @ 2016-06-21 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming

On 21 June 2016 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 21, 2016 at 02:36:46PM +0200, Vincent Guittot wrote:
>> On 21 June 2016 at 13:43, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, Jun 20, 2016 at 11:23:39AM +0200, Vincent Guittot wrote:
>> >
>> >> Don't we have to do a complete attach with attach_task_cfs_rq instead
>> >> of just the load_avg ? to set also depth ?
>> >
>> > Hmm, yes, your sched_set_group() change seems to have munged this.
>> >
>>
>> I think that it was done by the attach_task_cfs_rq during the activate_task.
>> Now, the attach is done in post_init_entity_util_avg. Can't we just
>> set the depth in post_init_entity_util_avg ?
>
> Yes, I actually had that patch for a little while.
>
> But since its cgroup specific, I felt it should be in the cgroup code,
> hence the current patch.

that's a fair point

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21  8:41                 ` Peter Zijlstra
  2016-06-21  4:51                   ` Yuyang Du
@ 2016-06-21 13:17                   ` Peter Zijlstra
  2016-06-21 13:29                     ` Vincent Guittot
  2016-06-23 15:35                   ` Dietmar Eggemann
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-21 13:17 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Tue, Jun 21, 2016 at 10:41:19AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 20, 2016 at 03:49:34PM +0100, Dietmar Eggemann wrote:
> > On 20/06/16 13:35, Vincent Guittot wrote:
> 
> > > It will go through wake_up_new_task and post_init_entity_util_avg
> > > during its fork which is enough to set last_update_time. Then, it will
> > > use the switched_to_fair if the task becomes a fair one
> > 
> > Oh I see. We want to make sure that every task (even when forked as
> > !fair) has a last_update_time value != 0, when becoming fair one day.
> 
> Right, see 2 below. I need to write a bunch of comments explaining PELT
> proper, as well as document these things.
> 
> The things we ran into with these patches were that:
> 
>  1) You need to update the cfs_rq _before_ any entity attach/detach
>     (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
>     returns true).
> 
>  2) (fair) entities are always attached, switched_from/to deal with !fair.
> 
>  3) cpu migration is the only exception and uses the last_update_time=0
>     thing -- because refusal to take second rq->lock.
> 
> Which is why I dislike Yuyang's patches, they create more exceptions
> instead of applying existing rules (albeit undocumented).
> 
> Esp. 1 is important, because while for mathematically consistency you
> don't actually need to do this, you only need the entities to be
> up-to-date with the cfs rq when you attach/detach, but that forgets the
> temporal aspect of _when_ you do this.

I have the below for now, I'll continue poking at this for a bit.


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -692,6 +692,7 @@ void init_entity_runnable_average(struct
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
+static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /*
@@ -757,7 +758,8 @@ void post_init_entity_util_avg(struct sc
 		}
 	}
 
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	if (update_cfs_rq_load_avg(now, cfs_rq, false))
+		update_tg_load_avg(cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 }
 
@@ -2919,7 +2921,21 @@ static inline void cfs_rq_util_change(st
 	WRITE_ONCE(*ptr, res);					\
 } while (0)
 
-/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
+/**
+ * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
+ * @now: current time, as per cfs_rq_clock_task()
+ * @cfs_rq: cfs_rq to update
+ * @update_freq: should we call cfs_rq_util_change() or will the call do so
+ *
+ * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
+ * avg. The immediate corollary is that all (fair) tasks must be attached, see
+ * post_init_entity_util_avg().
+ *
+ * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
+ *
+ * Returns true if the load decayed or we removed utilization. It is expected
+ * that one calls update_tg_load_avg() on this condition.
+ */
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
@@ -2974,6 +2990,14 @@ static inline void update_load_avg(struc
 		update_tg_load_avg(cfs_rq, 0);
 }
 
+/**
+ * attach_entity_load_avg - attach this entity to its cfs_rq load avg
+ * @cfs_rq: cfs_rq to attach to
+ * @se: sched_entity to attach
+ *
+ * Must call update_cfs_rq_load_avg() before this, since we rely on
+ * cfs_rq->avg.last_update_time being current.
+ */
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (!sched_feat(ATTACH_AGE_LOAD))
@@ -3005,6 +3029,14 @@ static void attach_entity_load_avg(struc
 	cfs_rq_util_change(cfs_rq);
 }
 
+/**
+ * detach_entity_load_avg - detach this entity from its cfs_rq load avg
+ * @cfs_rq: cfs_rq to detach from
+ * @se: sched_entity to detach
+ *
+ * Must call update_cfs_rq_load_avg() before this, since we rely on
+ * cfs_rq->avg.last_update_time being current.
+ */
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
@@ -8392,7 +8424,8 @@ static void detach_task_cfs_rq(struct ta
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	if (update_cfs_rq_load_avg(now, cfs_rq, false))
+		update_tg_load_avg(cfs_rq, false);
 	detach_entity_load_avg(cfs_rq, se);
 }
 
@@ -8411,7 +8444,8 @@ static void attach_task_cfs_rq(struct ta
 #endif
 
 	/* Synchronize task with its cfs_rq */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	if (update_cfs_rq_load_avg(now, cfs_rq, false))
+		update_tg_load_avg(cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 
 	if (!vruntime_normalized(p))

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21 13:17                   ` Peter Zijlstra
@ 2016-06-21 13:29                     ` Vincent Guittot
  2016-06-22 11:46                       ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Vincent Guittot @ 2016-06-21 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On 21 June 2016 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 21, 2016 at 10:41:19AM +0200, Peter Zijlstra wrote:
>> On Mon, Jun 20, 2016 at 03:49:34PM +0100, Dietmar Eggemann wrote:
>> > On 20/06/16 13:35, Vincent Guittot wrote:
>>
>> > > It will go through wake_up_new_task and post_init_entity_util_avg
>> > > during its fork which is enough to set last_update_time. Then, it will
>> > > use the switched_to_fair if the task becomes a fair one
>> >
>> > Oh I see. We want to make sure that every task (even when forked as
>> > !fair) has a last_update_time value != 0, when becoming fair one day.
>>
>> Right, see 2 below. I need to write a bunch of comments explaining PELT
>> proper, as well as document these things.
>>
>> The things we ran into with these patches were that:
>>
>>  1) You need to update the cfs_rq _before_ any entity attach/detach
>>     (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
>>     returns true).
>>
>>  2) (fair) entities are always attached, switched_from/to deal with !fair.
>>
>>  3) cpu migration is the only exception and uses the last_update_time=0
>>     thing -- because refusal to take second rq->lock.
>>
>> Which is why I dislike Yuyang's patches, they create more exceptions
>> instead of applying existing rules (albeit undocumented).
>>
>> Esp. 1 is important, because while for mathematically consistency you
>> don't actually need to do this, you only need the entities to be
>> up-to-date with the cfs rq when you attach/detach, but that forgets the
>> temporal aspect of _when_ you do this.
>
> I have the below for now, I'll continue poking at this for a bit.
>
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct
>
>  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>  static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
> +static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
>  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
>
>  /*
> @@ -757,7 +758,8 @@ void post_init_entity_util_avg(struct sc
>                 }
>         }
>
> -       update_cfs_rq_load_avg(now, cfs_rq, false);
> +       if (update_cfs_rq_load_avg(now, cfs_rq, false))
> +               update_tg_load_avg(cfs_rq, false);

You should move update_tg_load_avg after attach_entity_load_avg to
take into account the newly attached task

>         attach_entity_load_avg(cfs_rq, se);
>  }
>
> @@ -2919,7 +2921,21 @@ static inline void cfs_rq_util_change(st
>         WRITE_ONCE(*ptr, res);                                  \
>  } while (0)
>
> -/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
> +/**
> + * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
> + * @now: current time, as per cfs_rq_clock_task()
> + * @cfs_rq: cfs_rq to update
> + * @update_freq: should we call cfs_rq_util_change() or will the call do so
> + *
> + * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
> + * avg. The immediate corollary is that all (fair) tasks must be attached, see
> + * post_init_entity_util_avg().
> + *
> + * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
> + *
> + * Returns true if the load decayed or we removed utilization. It is expected
> + * that one calls update_tg_load_avg() on this condition.
> + */
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
>  {
> @@ -2974,6 +2990,14 @@ static inline void update_load_avg(struc
>                 update_tg_load_avg(cfs_rq, 0);
>  }
>
> +/**
> + * attach_entity_load_avg - attach this entity to its cfs_rq load avg
> + * @cfs_rq: cfs_rq to attach to
> + * @se: sched_entity to attach
> + *
> + * Must call update_cfs_rq_load_avg() before this, since we rely on
> + * cfs_rq->avg.last_update_time being current.
> + */
>  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         if (!sched_feat(ATTACH_AGE_LOAD))
> @@ -3005,6 +3029,14 @@ static void attach_entity_load_avg(struc
>         cfs_rq_util_change(cfs_rq);
>  }
>
> +/**
> + * detach_entity_load_avg - detach this entity from its cfs_rq load avg
> + * @cfs_rq: cfs_rq to detach from
> + * @se: sched_entity to detach
> + *
> + * Must call update_cfs_rq_load_avg() before this, since we rely on
> + * cfs_rq->avg.last_update_time being current.
> + */
>  static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
> @@ -8392,7 +8424,8 @@ static void detach_task_cfs_rq(struct ta
>         }
>
>         /* Catch up with the cfs_rq and remove our load when we leave */
> -       update_cfs_rq_load_avg(now, cfs_rq, false);
> +       if (update_cfs_rq_load_avg(now, cfs_rq, false))
> +               update_tg_load_avg(cfs_rq, false);

same as post_init_entity_util_avg. You should put it after the
detach_entity_load_avg

>         detach_entity_load_avg(cfs_rq, se);
>  }
>
> @@ -8411,7 +8444,8 @@ static void attach_task_cfs_rq(struct ta
>  #endif
>
>         /* Synchronize task with its cfs_rq */
> -       update_cfs_rq_load_avg(now, cfs_rq, false);
> +       if (update_cfs_rq_load_avg(now, cfs_rq, false))
> +               update_tg_load_avg(cfs_rq, false);

same as post_init_entity_util_avg

>         attach_entity_load_avg(cfs_rq, se);
>
>         if (!vruntime_normalized(p))

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21 13:29                     ` Vincent Guittot
@ 2016-06-22 11:46                       ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-22 11:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Tue, Jun 21, 2016 at 03:29:49PM +0200, Vincent Guittot wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct
> >
> >  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> >  static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
> > +static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
> >  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
> >
> >  /*
> > @@ -757,7 +758,8 @@ void post_init_entity_util_avg(struct sc
> >                 }
> >         }
> >
> > -       update_cfs_rq_load_avg(now, cfs_rq, false);
> > +       if (update_cfs_rq_load_avg(now, cfs_rq, false))
> > +               update_tg_load_avg(cfs_rq, false);
> 
> You should move update_tg_load_avg after attach_entity_load_avg to
> take into account the newly attached task

Right you are, I've also updated the comment to reflect this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 12:01 ` [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks Peter Zijlstra
  2016-06-17 14:09   ` Vincent Guittot
@ 2016-06-23 11:19   ` Peter Zijlstra
  2016-08-01  7:30   ` Wanpeng Li
  2 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-23 11:19 UTC (permalink / raw)
  To: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot

On Fri, Jun 17, 2016 at 02:01:40PM +0200, Peter Zijlstra wrote:
> @@ -8219,6 +8254,19 @@ static int cpu_cgroup_can_attach(struct
>  		if (task->sched_class != &fair_sched_class)
>  			return -EINVAL;
>  #endif
> +		/*
> +		 * Serialize against wake_up_new_task() such
> +		 * that if its running, we're sure to observe
> +		 * its full state.
> +		 */
> +		raw_spin_unlock_wait(&task->pi_lock);
> +		/*
> +		 * Avoid calling sched_move_task() before wake_up_new_task()
> +		 * has happened. This would lead to problems with PELT. See
> +		 * XXX.
> +		 */
> +		if (task->state == TASK_NEW)
> +			return -EINVAL;
>  	}
>  	return 0;
>  }

So I think that's backwards; we want:

		if (task->state == TASK_NEW)
			return -EINVAL;
		raw_spin_unlock_wait(&task->pi_lock);

Because failing the attach is 'safe', but if we do not observe NEW we
must be absolutely sure to observe the full wakeup_new.

But since its not critical I'll change it to more obvious code:

		raw_spin_lock_irq(&task->pi_lock);
		if (task->state == TASK_NEW)
			ret = -EINVAL;
		raw_spin_unlock_irq(&task->pi_lock);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21  8:41                 ` Peter Zijlstra
  2016-06-21  4:51                   ` Yuyang Du
  2016-06-21 13:17                   ` Peter Zijlstra
@ 2016-06-23 15:35                   ` Dietmar Eggemann
  2016-06-23 17:15                     ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Dietmar Eggemann @ 2016-06-23 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On 21/06/16 09:41, Peter Zijlstra wrote:
> On Mon, Jun 20, 2016 at 03:49:34PM +0100, Dietmar Eggemann wrote:
>> On 20/06/16 13:35, Vincent Guittot wrote:
> 
>>> It will go through wake_up_new_task and post_init_entity_util_avg
>>> during its fork which is enough to set last_update_time. Then, it will
>>> use the switched_to_fair if the task becomes a fair one
>>
>> Oh I see. We want to make sure that every task (even when forked as
>> !fair) has a last_update_time value != 0, when becoming fair one day.
> 
> Right, see 2 below. I need to write a bunch of comments explaining PELT
> proper, as well as document these things.
> 
> The things we ran into with these patches were that:
> 
>  1) You need to update the cfs_rq _before_ any entity attach/detach
>     (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
>     returns true).
> 
>  2) (fair) entities are always attached, switched_from/to deal with !fair.
> 
>  3) cpu migration is the only exception and uses the last_update_time=0
>     thing -- because refusal to take second rq->lock.

2) is about changing sched classes, 3) is about changing cpus but what
about 4) changing task groups?

There is still this last_update_time = 0 between
detach_task_cfs_rq()/set_task_rq() and attach_task_cfs_rq() in
task_move_group_fair() preventing the call __update_load_avg(...
p->se->avg, ...) in attach_task_cfs_rq() -> attach_entity_load_avg().

Shouldn't be necessary any more since cfs_rq 'next' is up-to-date now.

Assuming here that the exception in 3) relates to the fact that the
rq->lock is not taken.

Or is 4) a second exception in the sense that the se has been aged in
remove_entity_load_avg() (3)) resp. detach_entity_load_avg() (4))?

> Which is why I dislike Yuyang's patches, they create more exceptions
> instead of applying existing rules (albeit undocumented).
> 
> Esp. 1 is important, because while for mathematically consistency you
> don't actually need to do this, you only need the entities to be
> up-to-date with the cfs rq when you attach/detach, but that forgets the
> temporal aspect of _when_ you do this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-23 15:35                   ` Dietmar Eggemann
@ 2016-06-23 17:15                     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-23 17:15 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Yuyang Du, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Thu, Jun 23, 2016 at 04:35:54PM +0100, Dietmar Eggemann wrote:
> > The things we ran into with these patches were that:
> > 
> >  1) You need to update the cfs_rq _before_ any entity attach/detach
> >     (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
> >     returns true).
> > 
> >  2) (fair) entities are always attached, switched_from/to deal with !fair.
> > 
> >  3) cpu migration is the only exception and uses the last_update_time=0
> >     thing -- because refusal to take second rq->lock.
> 
> 2) is about changing sched classes, 3) is about changing cpus but what
> about 4) changing task groups?
> 
> There is still this last_update_time = 0 between
> detach_task_cfs_rq()/set_task_rq() and attach_task_cfs_rq() in
> task_move_group_fair() preventing the call __update_load_avg(...
> p->se->avg, ...) in attach_task_cfs_rq() -> attach_entity_load_avg().
> 
> Shouldn't be necessary any more since cfs_rq 'next' is up-to-date now.
> 
> Assuming here that the exception in 3) relates to the fact that the
> rq->lock is not taken.
> 
> Or is 4) a second exception in the sense that the se has been aged in
> remove_entity_load_avg() (3)) resp. detach_entity_load_avg() (4))?

Ah, good point, so move between groups is also special in that the time
between cgroups doesn't need to match.

And since we've aged the se to 'now' on detach, we can assume its
up-to-date (and hence last_update_time=0) for attach, which then only
updates its cfs_rq to 'now' before attaching the se.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-21  4:51                   ` Yuyang Du
@ 2016-06-24 13:03                     ` Peter Zijlstra
  2016-08-09 23:19                       ` Yuyang Du
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-06-24 13:03 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Dietmar Eggemann, Vincent Guittot, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

Sorry, I only spotted your reply yesterday.

On Tue, Jun 21, 2016 at 12:51:39PM +0800, Yuyang Du wrote:
> On Tue, Jun 21, 2016 at 10:41:19AM +0200, Peter Zijlstra wrote:

> > The things we ran into with these patches were that:
> > 
> >  1) You need to update the cfs_rq _before_ any entity attach/detach
> >     (and might need to update_tg_load_avg when update_cfs_rq_load_avg()
> >     returns true).
> 
> This is intrinsically an additional update, not a fix to anything. I
> don't think it is a must, but I am fine with it.

> > Esp. 1 is important, because while for mathematically consistency you
> > don't actually need to do this, you only need the entities to be
> > up-to-date with the cfs rq when you attach/detach, but that forgets the
> > temporal aspect of _when_ you do this.
>  
> Yes, temporally at any instant the avgs are outdated. But, I can have it,
> and what if I have it?

So I see your point; but there's a big difference between 'instant' and
10ms (HZ=100). So by aging the cfs_rq to the instant we fix two issues:

 - that it can be up to 10ms stale
 - that is can be 'uninitialized' at all

It also makes code consistent, all other sites also do this.

> >  3) cpu migration is the only exception and uses the last_update_time=0
> >     thing -- because refusal to take second rq->lock.
> 
> Task's last_update_time means this task is detached from fair queue. This
> (re)definition is by all means much better than migrating. No?

I would maybe redefine it as an up-to-date marker for a migration across
a clock discontinuity. Both CPU migration and group movement suffer from
this, albeit for different reasons.

In the CPU migration case we simply cannot tell time by our refusal to
acquire the old rq lock. So we age to the last time we 'know' and then
mark it up-to-date.

For the cgroup move the timelines simply _are_ discontinuous. So we have
to mark it up-to-date after we update it to the instant of detach, such
that when we attach it to the new group we don't try to age it across
the time difference.

> > Which is why I dislike Yuyang's patches, they create more exceptions
> > instead of applying existing rules (albeit undocumented).
> > 
> 
> I am thinking about document this really well, like "An art of load tracking:
> accuracy, overhead, and usefulness", seriously.

Any attempt to document all this would be greatly appreciated, although
I would like it to be in comments in the fair.c file itself if possible.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-17 12:01 ` [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks Peter Zijlstra
  2016-06-17 14:09   ` Vincent Guittot
  2016-06-23 11:19   ` Peter Zijlstra
@ 2016-08-01  7:30   ` Wanpeng Li
  2016-08-01  9:31     ` Mike Galbraith
  2 siblings, 1 reply; 39+ messages in thread
From: Wanpeng Li @ 2016-08-01  7:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Mike Galbraith,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot

2016-06-17 20:01 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> Vincent and Yuyang found another few scenarios in which entity
> tracking goes wobbly.
>
> The scenarios are basically due to the fact that new tasks are not
> immediately attached and thereby differ from the normal situation -- a
> task is always attached to a cfs_rq load average (such that it
> includes its blocked contribution) and are explicitly
> detached/attached on migration to another cfs_rq.
>
> Scenario 1: switch to fair class
>
>   p->sched_class = fair_class;
>   if (queued)
>     enqueue_task(p);
>       ...
>         enqueue_entity()
>           enqueue_entity_load_avg()
>             migrated = !sa->last_update_time (true)
>             if (migrated)
>               attach_entity_load_avg()
>   check_class_changed()
>     switched_from() (!fair)
>     switched_to()   (fair)
>       switched_to_fair()
>         attach_entity_load_avg()
>
> If @p is a new task that hasn't been fair before, it will have
> !last_update_time and, per the above, end up in
> attach_entity_load_avg() _twice_.
>
> Scenario 2: change between cgroups
>
>   sched_move_group(p)
>     if (queued)
>       dequeue_task()
>     task_move_group_fair()
>       detach_task_cfs_rq()
>         detach_entity_load_avg()
>       set_task_rq()
>       attach_task_cfs_rq()
>         attach_entity_load_avg()
>     if (queued)
>       enqueue_task();
>         ...
>           enqueue_entity()
>             enqueue_entity_load_avg()
>               migrated = !sa->last_update_time (true)
>               if (migrated)
>                 attach_entity_load_avg()
>
> Similar as with scenario 1, if @p is a new task, it will have
> !load_update_time and we'll end up in attach_entity_load_avg()
> _twice_.
>
> Furthermore, notice how we do a detach_entity_load_avg() on something
> that wasn't attached to begin with.
>
> As stated above; the problem is that the new task isn't yet attached
> to the load tracking and thereby violates the invariant assumption.
>
> This patch remedies this by ensuring a new task is indeed properly
> attached to the load tracking on creation, through
> post_init_entity_util_avg().
>
> Of course, this isn't entirely as straight forward as one might think,
> since the task is hashed before we call wake_up_new_task() and thus

What's the meaning of "the task is hashed before we call wake_up_new_task()"?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-08-01  7:30   ` Wanpeng Li
@ 2016-08-01  9:31     ` Mike Galbraith
  2016-08-01  9:56       ` Wanpeng Li
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2016-08-01  9:31 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra
  Cc: Yuyang Du, Ingo Molnar, linux-kernel, Benjamin Segall,
	Paul Turner, Morten Rasmussen, Dietmar Eggemann, Matt Fleming,
	Vincent Guittot

On Mon, 2016-08-01 at 15:30 +0800, Wanpeng Li wrote:

> What's the meaning of "the task is hashed before we call
> wake_up_new_task()"?

See fork.c::copy_process()

        /*
         * Make it visible to the rest of the system, but dont wake it up yet.
         * Need tasklist lock for parent etc handling!
         */
        write_lock_irq(&tasklist_lock);
        ....

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-08-01  9:31     ` Mike Galbraith
@ 2016-08-01  9:56       ` Wanpeng Li
  2016-08-01 11:52         ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Wanpeng Li @ 2016-08-01  9:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Yuyang Du, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot

2016-08-01 17:31 GMT+08:00 Mike Galbraith <umgwanakikbuti@gmail.com>:
> On Mon, 2016-08-01 at 15:30 +0800, Wanpeng Li wrote:
>
>> What's the meaning of "the task is hashed before we call
>> wake_up_new_task()"?
>
> See fork.c::copy_process()
>
>         /*
>          * Make it visible to the rest of the system, but dont wake it up yet.
>          * Need tasklist lock for parent etc handling!
>          */
>         write_lock_irq(&tasklist_lock);
>         ....

Thanks Mike, so here "is hashed" means that add to list instead of real hashing.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-08-01  9:56       ` Wanpeng Li
@ 2016-08-01 11:52         ` Mike Galbraith
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2016-08-01 11:52 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Yuyang Du, Ingo Molnar, linux-kernel,
	Benjamin Segall, Paul Turner, Morten Rasmussen, Dietmar Eggemann,
	Matt Fleming, Vincent Guittot

On Mon, 2016-08-01 at 17:56 +0800, Wanpeng Li wrote:
> 2016-08-01 17:31 GMT+08:00 Mike Galbraith <umgwanakikbuti@gmail.com>:
> > On Mon, 2016-08-01 at 15:30 +0800, Wanpeng Li wrote:
> > 
> > > What's the meaning of "the task is hashed before we call
> > > wake_up_new_task()"?
> > 
> > See fork.c::copy_process()
> > 
> >         /*
> >          * Make it visible to the rest of the system, but dont wake
> > it up yet.
> >          * Need tasklist lock for parent etc handling!
> >          */
> >         write_lock_irq(&tasklist_lock);
> >         ....
> 
> Thanks Mike, so here "is hashed" means that add to list instead of 
> real hashing.

Yeah.

	-Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks
  2016-06-24 13:03                     ` Peter Zijlstra
@ 2016-08-09 23:19                       ` Yuyang Du
  0 siblings, 0 replies; 39+ messages in thread
From: Yuyang Du @ 2016-08-09 23:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Ingo Molnar, linux-kernel,
	Mike Galbraith, Benjamin Segall, Paul Turner, Morten Rasmussen,
	Matt Fleming

On Fri, Jun 24, 2016 at 03:03:24PM +0200, Peter Zijlstra wrote:
> > Task's last_update_time means this task is detached from fair queue. This
> > (re)definition is by all means much better than migrating. No?
> 
> I would maybe redefine it as an up-to-date marker for a migration across
> a clock discontinuity. Both CPU migration and group movement suffer from
> this, albeit for different reasons.
> 
> In the CPU migration case we simply cannot tell time by our refusal to
> acquire the old rq lock. So we age to the last time we 'know' and then
> mark it up-to-date.
> 
> For the cgroup move the timelines simply _are_ discontinuous. So we have
> to mark it up-to-date after we update it to the instant of detach, such
> that when we attach it to the new group we don't try to age it across
> the time difference.

Got it. I'm glad to see you worked out all these.
 
It's like I am replying to an email in the last century, sorry, but I was
distracted for many other things.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-08-10 18:45 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-17 12:01 [PATCH 0/4] sched/fair: Fix PELT wobblies Peter Zijlstra
2016-06-17 12:01 ` [PATCH 1/4] sched: Optimize fork() paths Peter Zijlstra
2016-06-17 12:01 ` [PATCH 2/4] sched/fair: Fix PELT integrity for new groups Peter Zijlstra
2016-06-17 13:51   ` Vincent Guittot
2016-06-17 12:01 ` [PATCH 3/4] sched,cgroup: Fix cpu_cgroup_fork() Peter Zijlstra
2016-06-17 13:58   ` Vincent Guittot
2016-06-17 14:06     ` Peter Zijlstra
2016-06-17 12:01 ` [PATCH 4/4] sched,fair: Fix PELT integrity for new tasks Peter Zijlstra
2016-06-17 14:09   ` Vincent Guittot
2016-06-17 14:28     ` Peter Zijlstra
2016-06-17 16:02       ` Peter Zijlstra
2016-06-17 16:14         ` Vincent Guittot
2016-06-17 16:18         ` Peter Zijlstra
2016-06-19 22:55           ` Yuyang Du
2016-06-20  9:23           ` Vincent Guittot
2016-06-20  9:52             ` Peter Zijlstra
2016-06-20 10:07               ` Vincent Guittot
2016-06-21 11:43             ` Peter Zijlstra
2016-06-21 12:36               ` Vincent Guittot
2016-06-21 12:47                 ` Peter Zijlstra
2016-06-21 12:56                   ` Vincent Guittot
2016-06-20 11:35           ` Dietmar Eggemann
2016-06-20 12:35             ` Vincent Guittot
2016-06-20 14:49               ` Dietmar Eggemann
2016-06-21  8:41                 ` Peter Zijlstra
2016-06-21  4:51                   ` Yuyang Du
2016-06-24 13:03                     ` Peter Zijlstra
2016-08-09 23:19                       ` Yuyang Du
2016-06-21 13:17                   ` Peter Zijlstra
2016-06-21 13:29                     ` Vincent Guittot
2016-06-22 11:46                       ` Peter Zijlstra
2016-06-23 15:35                   ` Dietmar Eggemann
2016-06-23 17:15                     ` Peter Zijlstra
2016-06-20 14:27             ` Peter Zijlstra
2016-06-23 11:19   ` Peter Zijlstra
2016-08-01  7:30   ` Wanpeng Li
2016-08-01  9:31     ` Mike Galbraith
2016-08-01  9:56       ` Wanpeng Li
2016-08-01 11:52         ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.