All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load
@ 2016-09-12  7:47 Vincent Guittot
  2016-09-12  7:47 ` [PATCH 1/7 v3] sched: factorize attach entity Vincent Guittot
                   ` (6 more replies)
  0 siblings, 7 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

Ensure that the move of a sched_entity will be reflected in load and
utilization of the task_group hierarchy.

When a sched_entity moves between groups or CPUs, load and utilization
of cfs_rq don't reflect the changes immediately but converge to new values.
As a result, the metrics are no more aligned with the new balance of the
load in the system and next decisions will have a biased view.

This patchset synchronizes load/utilization of sched_entity with its child
cfs_rq (se->my-q) only when tasks move to/from child cfs_rq:
-move between task group
-migration between CPUs
Otherwise, PELT is updated as usual.

Changes since v2:
- Propagate both utilization and load
- Synced sched_entity and se->my_q instead of adding the delta  

Changes since v1:
- This patch needs the patch that fixes issue with rq->leaf_cfs_rq_list
  "sched: fix hierarchical order in rq->leaf_cfs_rq_list" in order to work
  correctly. I haven't sent them as a single patchset because the fix is
  independent of this one
- Merge some functions that are always used together
- During update of blocked load, ensure that the sched_entity is synced
  with the cfs_rq applying changes
- Fix an issue when task changes its cpu affinity

Vincent Guittot (7):
  sched: factorize attach entity
  sched: fix hierarchical order in rq->leaf_cfs_rq_list
  sched: factorize PELT update
  sched: propagate load during synchronous attach/detach
  sched: propagate asynchrous detach
  sched: fix task group initialization
  sched: fix wrong utilization accounting when switching to fair class

 kernel/sched/core.c  |  21 ++--
 kernel/sched/fair.c  | 319 ++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h |   2 +
 3 files changed, 264 insertions(+), 78 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/7 v3] sched: factorize attach entity
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-12  7:47 ` [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

Factorize post_init_entity_util_avg and part of attach_task_cfs_rq
in one function attach_entity_cfs_rq

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6820b3..dfd9c0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -694,9 +694,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
-static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static void attach_entity_cfs_rq(struct sched_entity *se);
 
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
@@ -761,9 +759,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		}
 	}
 
-	update_cfs_rq_load_avg(now, cfs_rq, false);
-	attach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq, false);
+	attach_entity_cfs_rq(se);
 }
 
 #else /* !CONFIG_SMP */
@@ -8498,9 +8494,8 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	update_tg_load_avg(cfs_rq, false);
 }
 
-static void attach_task_cfs_rq(struct task_struct *p)
+static void attach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 
@@ -8516,6 +8511,14 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	update_cfs_rq_load_avg(now, cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+}
+
+static void attach_task_cfs_rq(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	attach_entity_cfs_rq(se);
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
  2016-09-12  7:47 ` [PATCH 1/7 v3] sched: factorize attach entity Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-21 10:14   ` Dietmar Eggemann
  2016-09-12  7:47 ` [PATCH 3/7 v3] sched: factorize PELT update Vincent Guittot
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
a child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its parent.

Lets take the example of
       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at the
tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the list:
head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last whereas
it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a parent
that is already on the list. We can use this event to detect when we have
finished to add a new branch. For the others, whose parents are not already
added, we have to ensure that they will be added after their children that
have just been inserted the steps before, and after any potential parents that
are already in the list. The easiest way is to put the cfs_rq just after the
last inserted one and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 54 +++++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  1 +
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 860070f..3e52d08 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7489,6 +7489,7 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
 		/*
 		 * How much cpu bandwidth does root_task_group get?
 		 *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dfd9c0c..264119a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -292,19 +292,59 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+		int cpu = cpu_of(rq);
 		/*
 		 * Ensure we either appear before our parent (if already
 		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
+		 * enqueued. The fact that we always enqueue bottom-up
+		 * reduces this to two cases and a special case for the root
+		 * cfs_rq. Furthermore, it also means that we will always reset
+		 * tmp_alone_branch either when the branch is connected
+		 * to a tree or when we reach the beg of the tree
 		 */
 		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
+		    cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
+			/*
+			 * If parent is already on the list, we add the child
+			 * just before. Thanks to circular linked property of
+			 * the list, this means to put the child at the tail
+			 * of the list that starts by parent.
+			 */
+			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+			/*
+			 * The branch is now connected to its tree so we can
+			 * reset tmp_alone_branch to the beginning of the
+			 * list.
+			 */
+			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		} else if (!cfs_rq->tg->parent) {
+			/*
+			 * cfs rq without parent should be put
+			 * at the tail of the list.
+			 */
 			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+				&rq->leaf_cfs_rq_list);
+			/*
+			 * We have reach the beg of a tree so we can reset
+			 * tmp_alone_branch to the beginning of the list.
+			 */
+			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		} else {
+			/*
+			 * The parent has not already been added so we want to
+			 * make sure that it will be put after us.
+			 * tmp_alone_branch points to the beg of the branch
+			 * where we will add parent.
+			 */
+			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				rq->tmp_alone_branch);
+			/*
+			 * upodate tmp_alone_branch to points to the new beg
+			 * of the branch
+			 */
+			rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
 		}
 
 		cfs_rq->on_list = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 420c05d..483616a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -616,6 +616,7 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
+	struct list_head *tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/7 v3] sched: factorize PELT update
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
  2016-09-12  7:47 ` [PATCH 1/7 v3] sched: factorize attach entity Vincent Guittot
  2016-09-12  7:47 ` [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-15 13:09   ` Peter Zijlstra
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

Every time, we modify load/utilization of sched_entity, we start to sync
it with its cfs_rq. This update is done is different ways:
-when attaching/detaching a sched_entity, we update cfs_rq and then we
sync the entity with the cfs_rq.
-when enqueueing/dequeuing the sched_entity, we update both sched_entity
and cfs_rq metrics to now.

Use update_load_avg everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 68 ++++++++++++++---------------------------------------
 1 file changed, 17 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 264119a..0aa1d7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3086,7 +3086,8 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 }
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_load_avg(struct sched_entity *se, int update_tg,
+				   int skip_aging)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3097,7 +3098,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
+	if (se->avg.last_update_time && !skip_aging)
+		__update_load_avg(now, cpu, &se->avg,
 			  se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
@@ -3115,26 +3117,6 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
  */
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		goto skip_aging;
-
-	/*
-	 * If we got migrated (either between CPUs or between cgroups) we'll
-	 * have aged the average right before clearing @last_update_time.
-	 *
-	 * Or we're fresh through post_init_entity_util_avg().
-	 */
-	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
-
-		/*
-		 * XXX: we could have just aged the entire load away if we've been
-		 * absent from the fair class for too long.
-		 */
-	}
-
-skip_aging:
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se->avg.load_sum;
@@ -3154,9 +3136,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
  */
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
@@ -3171,34 +3150,20 @@ static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	int migrated, decayed;
-
-	migrated = !sa->last_update_time;
-	if (!migrated) {
-		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
-			cfs_rq->curr == se, NULL);
-	}
-
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, !migrated);
 
 	cfs_rq->runnable_load_avg += sa->load_avg;
 	cfs_rq->runnable_load_sum += sa->load_sum;
 
-	if (migrated)
+	if (!sa->last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
-
-	if (decayed || migrated)
 		update_tg_load_avg(cfs_rq, 0);
+	}
 }
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
-
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
 	cfs_rq->runnable_load_sum =
@@ -3272,7 +3237,8 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return 0;
 }
 
-static inline void update_load_avg(struct sched_entity *se, int not_used)
+static inline void update_load_avg(struct sched_entity *se,
+				   int not_used1, int not_used2)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct rq *rq = rq_of(cfs_rq);
@@ -3420,6 +3386,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
+	update_load_avg(se, 1, 0);
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -3494,6 +3461,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_load_avg(se, 1, 0);
 	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -3572,7 +3540,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, 0);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3674,7 +3642,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_load_avg(prev, 0, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3690,7 +3658,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_load_avg(curr, 1, 0);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4579,7 +4547,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4638,7 +4606,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -8517,7 +8485,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (!vruntime_normalized(p)) {
 		/*
@@ -8529,7 +8496,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	}
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	update_load_avg(se, 0, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
@@ -8537,7 +8504,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
 static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -8548,7 +8514,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 #endif
 
 	/* Synchronize task with its cfs_rq */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	update_load_avg(se, 0, !sched_feat(ATTACH_AGE_LOAD));
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (2 preceding siblings ...)
  2016-09-12  7:47 ` [PATCH 3/7 v3] sched: factorize PELT update Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-15 12:55   ` Peter Zijlstra
                     ` (4 more replies)
  2016-09-12  7:47 ` [PATCH 5/7 v3] sched: propagate asynchrous detach Vincent Guittot
                   ` (2 subsequent siblings)
  6 siblings, 5 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rw is unthrottled.

For propagating the utilization, we copy the utilization of child cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is less than its share so it will
contribute the same load of a task of equal weight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   1 +
 2 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0aa1d7d..e4015f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3017,6 +3017,132 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	}
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Take into account change of utilization of a child task group */
+static inline void
+update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq =  group_cfs_rq(se);
+	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's utilizaton */
+	se->avg.util_avg = gcfs_rq->avg.util_avg;
+	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq utilization */
+	cfs_rq->avg.util_avg =  max_t(long, cfs_rq->avg.util_avg + delta, 0);
+	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
+}
+
+/* Take into account change of load of a child task group */
+static inline void
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long delta, load = gcfs_rq->avg.load_avg;
+
+	/* If the load of group cfs_rq is null, the load of the
+	 * sched_entity will also be null so we can skip the formula
+	 */
+	if (load) {
+		long tg_load;
+
+		/* Get tg's load and ensure tg_load > 0 */
+		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
+
+		/* Ensure tg_load >= load and updated with current load*/
+		tg_load -= gcfs_rq->tg_load_avg_contrib;
+		tg_load += load;
+
+		/* scale gcfs_rq's load into tg's shares*/
+		load *= scale_load_down(gcfs_rq->tg->shares);
+		load /= tg_load;
+
+		/*
+		 * we need to compute a correction term in the case that the
+		 * task group is consuming <1 cpu so that we would contribute
+		 * the same load as a task of equal weight.
+		*/
+		if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
+			load *= tg_load;
+			load /= scale_load_down(gcfs_rq->tg->shares);
+		}
+	}
+
+	delta = load - se->avg.load_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's load */
+	se->avg.load_avg = load;
+	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq load */
+	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg + delta, 0);
+	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+}
+
+static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
+{
+	/* set cfs_rq's flag */
+	cfs_rq->propagate_avg = 1;
+}
+
+static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
+{
+	/* Get my cfs_rq */
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
+
+	/* Nothing to propagate */
+	if (!cfs_rq->propagate_avg)
+		return 0;
+
+	/* Clear my cfs_rq's flag */
+	cfs_rq->propagate_avg = 0;
+
+	return 1;
+}
+
+/* Update task and its cfs_rq load average */
+static inline int propagate_entity_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	if (entity_is_task(se))
+		return 0;
+
+	if (!test_and_clear_tg_cfs_propagate(se))
+		return 0;
+
+	/* Get parent cfs_rq */
+	cfs_rq = cfs_rq_of(se);
+
+	/* Propagate to parent */
+	set_tg_cfs_propagate(cfs_rq);
+
+	/* Update utilization */
+	update_tg_cfs_util(cfs_rq, se);
+
+	/* Update load */
+	update_tg_cfs_load(cfs_rq, se);
+
+	return 1;
+}
+#else
+static inline int propagate_entity_load_avg(struct sched_entity *se)
+{
+	return 0;
+}
+
+static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
+#endif
+
 /*
  * Unsigned subtract and clamp on underflow.
  *
@@ -3093,6 +3219,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg,
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
+	int decayed;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -3103,7 +3230,11 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg,
 			  se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
+	decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
+
+	decayed |= propagate_entity_load_avg(se);
+
+	if (decayed && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
 }
 
@@ -3122,6 +3253,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq->avg.load_sum += se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	set_tg_cfs_propagate(cfs_rq);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3141,6 +3273,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	set_tg_cfs_propagate(cfs_rq);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -8499,6 +8632,22 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	update_load_avg(se, 0, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * Propagate the detach across the tg tree to make it visible to the
+	 * root
+	 */
+	se = se->parent;
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, 1, 0);
+	}
+#endif
 }
 
 static void attach_entity_cfs_rq(struct sched_entity *se)
@@ -8517,6 +8666,22 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 	update_load_avg(se, 0, !sched_feat(ATTACH_AGE_LOAD));
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * Propagate the attach across the tg tree to make it visible to the
+	 * root
+	 */
+	se = se->parent;
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, 1, 0);
+	}
+#endif
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -8578,6 +8743,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
 #ifdef CONFIG_SMP
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	cfs_rq->propagate_avg = 0;
+#endif
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
 	atomic_long_set(&cfs_rq->removed_util_avg, 0);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 483616a..0517a9e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,7 @@ struct cfs_rq {
 	unsigned long runnable_load_avg;
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
+	unsigned long propagate_avg;
 #endif
 	atomic_long_t removed_load_avg, removed_util_avg;
 #ifndef CONFIG_64BIT
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/7 v3] sched: propagate asynchrous detach
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (3 preceding siblings ...)
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-12  7:47 ` [PATCH 6/7 v3] sched: fix task group initialization Vincent Guittot
  2016-09-12  7:47 ` [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class Vincent Guittot
  6 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set propagation
flag.

During the load balance, we take advanatge of the update of blocked load
to we propagate any pending changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4015f6..25533ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3188,6 +3188,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		sub_positive(&sa->load_avg, r);
 		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
+		set_tg_cfs_propagate(cfs_rq);
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
@@ -3195,6 +3196,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
+		set_tg_cfs_propagate(cfs_rq);
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -6567,6 +6569,10 @@ static void update_blocked_averages(int cpu)
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
+
+		/* Propagate pending load changes to the parent */
+		if (cfs_rq->tg->se[cpu])
+			update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/7 v3] sched: fix task group initialization
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (4 preceding siblings ...)
  2016-09-12  7:47 ` [PATCH 5/7 v3] sched: propagate asynchrous detach Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-12  7:47 ` [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class Vincent Guittot
  6 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

The moves of tasks are now propagated down to root and the utilization
of cfs_rq reflects reality so it doesn't need to be estimated at init.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25533ba..ad6ca84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8863,7 +8863,7 @@ void online_fair_sched_group(struct task_group *tg)
 		se = tg->se[i];
 
 		raw_spin_lock_irq(&rq->lock);
-		post_init_entity_util_avg(se);
+		attach_entity_cfs_rq(se);
 		sync_throttle(tg, i);
 		raw_spin_unlock_irq(&rq->lock);
 	}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (5 preceding siblings ...)
  2016-09-12  7:47 ` [PATCH 6/7 v3] sched: fix task group initialization Vincent Guittot
@ 2016-09-12  7:47 ` Vincent Guittot
  2016-09-15 13:18   ` Peter Zijlstra
                     ` (2 more replies)
  6 siblings, 3 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-12  7:47 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

When a task switches to fair scheduling class, the period between now and
the last update of its utilization is accounted as running time whatever
happened during this period. This wrong accounting applies to the task
and also to the task group branch.

When changing the property of a running task like its list of allowed CPUs
or its scheduling class, we follow the sequence:
-dequeue task
-put task
-change the property
-set task as current task
-enqueue task

The end of the sequence doesn't follow the normal sequence which is :
-enqueue a task
-then set the task as current task.

This wrong ordering is the root cause of wrong utilization accounting.
Update the sequence to follow the right one:
-dequeue task
-put task
-change the property
-enqueue task
-set task as current task

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e52d08..7a9c9b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1105,10 +1105,10 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 
 	p->sched_class->set_cpus_allowed(p, new_mask);
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 }
 
 /*
@@ -3687,10 +3687,10 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
 	p->prio = prio;
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -4243,8 +4243,6 @@ static int __sched_setscheduler(struct task_struct *p,
 	prev_class = p->sched_class;
 	__setscheduler(rq, p, attr, pi);
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued) {
 		/*
 		 * We enqueue to tail when the priority of a task is
@@ -4255,6 +4253,8 @@ static int __sched_setscheduler(struct task_struct *p,
 
 		enqueue_task(rq, p, queue_flags);
 	}
+	if (running)
+		p->sched_class->set_curr_task(rq);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 	preempt_disable(); /* avoid rq from going away on us */
@@ -5417,10 +5417,10 @@ void sched_setnuma(struct task_struct *p, int nid)
 
 	p->numa_preferred_nid = nid;
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -7868,10 +7868,10 @@ void sched_move_task(struct task_struct *tsk)
 
 	sched_change_group(tsk, TASK_MOVE_GROUP);
 
-	if (unlikely(running))
-		tsk->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE);
+	if (unlikely(running))
+		tsk->sched_class->set_curr_task(rq);
 
 	task_rq_unlock(rq, tsk, &rf);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
@ 2016-09-15 12:55   ` Peter Zijlstra
  2016-09-15 13:01     ` Vincent Guittot
  2016-09-15 12:59   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 12:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, yuyang.du, Morten.Rasmussen, linaro-kernel,
	dietmar.eggemann, pjt, bsegall

On Mon, Sep 12, 2016 at 09:47:49AM +0200, Vincent Guittot wrote:
 +/* Take into account change of utilization of a child task group */
> +static inline void
> +update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	struct cfs_rq *gcfs_rq =  group_cfs_rq(se);
> +	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
> +
> +	/* Nothing to update */
> +	if (!delta)
> +		return;
> +
> +	/* Set new sched_entity's utilizaton */
> +	se->avg.util_avg = gcfs_rq->avg.util_avg;
> +	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> +
> +	/* Update parent cfs_rq utilization */
> +	cfs_rq->avg.util_avg =  max_t(long, cfs_rq->avg.util_avg + delta, 0);

This..

> +	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
> +}
> +
> +/* Take into account change of load of a child task group */
> +static inline void
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> +	long delta, load = gcfs_rq->avg.load_avg;
> +
> +	/* If the load of group cfs_rq is null, the load of the
> +	 * sched_entity will also be null so we can skip the formula
> +	 */
> +	if (load) {
> +		long tg_load;
> +
> +		/* Get tg's load and ensure tg_load > 0 */
> +		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> +
> +		/* Ensure tg_load >= load and updated with current load*/
> +		tg_load -= gcfs_rq->tg_load_avg_contrib;
> +		tg_load += load;
> +
> +		/* scale gcfs_rq's load into tg's shares*/
> +		load *= scale_load_down(gcfs_rq->tg->shares);
> +		load /= tg_load;
> +
> +		/*
> +		 * we need to compute a correction term in the case that the
> +		 * task group is consuming <1 cpu so that we would contribute
> +		 * the same load as a task of equal weight.
> +		*/
> +		if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
> +			load *= tg_load;
> +			load /= scale_load_down(gcfs_rq->tg->shares);
> +		}
> +	}
> +
> +	delta = load - se->avg.load_avg;
> +
> +	/* Nothing to update */
> +	if (!delta)
> +		return;
> +
> +	/* Set new sched_entity's load */
> +	se->avg.load_avg = load;
> +	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
> +
> +	/* Update parent cfs_rq load */
> +	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg + delta, 0);

And this..

> +	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> +}

Re-introduce the issue from: 897418922215 ("sched/fair: Fix cfs_rq avg
tracking underflow").

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
  2016-09-15 12:55   ` Peter Zijlstra
@ 2016-09-15 12:59   ` Peter Zijlstra
  2016-09-15 13:11     ` Vincent Guittot
  2016-09-15 13:11   ` Dietmar Eggemann
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 12:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, yuyang.du, Morten.Rasmussen, linaro-kernel,
	dietmar.eggemann, pjt, bsegall

On Mon, Sep 12, 2016 at 09:47:49AM +0200, Vincent Guittot wrote:
> +	/* If the load of group cfs_rq is null, the load of the
> +	 * sched_entity will also be null so we can skip the formula
> +	 */

https://lkml.kernel.org/r/CA+55aFyQYJerovMsSoSKS7PessZBr4vNp-3QUUwhqk4A4_jcbg@mail.gmail.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 12:55   ` Peter Zijlstra
@ 2016-09-15 13:01     ` Vincent Guittot
  0 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-15 13:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 15 September 2016 at 14:55, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 12, 2016 at 09:47:49AM +0200, Vincent Guittot wrote:
>  +/* Take into account change of utilization of a child task group */
>> +static inline void
>> +update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +{
>> +     struct cfs_rq *gcfs_rq =  group_cfs_rq(se);
>> +     long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
>> +
>> +     /* Nothing to update */
>> +     if (!delta)
>> +             return;
>> +
>> +     /* Set new sched_entity's utilizaton */
>> +     se->avg.util_avg = gcfs_rq->avg.util_avg;
>> +     se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
>> +
>> +     /* Update parent cfs_rq utilization */
>> +     cfs_rq->avg.util_avg =  max_t(long, cfs_rq->avg.util_avg + delta, 0);
>
> This..
>
>> +     cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
>> +}
>> +
>> +/* Take into account change of load of a child task group */
>> +static inline void
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +{
>> +     struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> +     long delta, load = gcfs_rq->avg.load_avg;
>> +
>> +     /* If the load of group cfs_rq is null, the load of the
>> +      * sched_entity will also be null so we can skip the formula
>> +      */
>> +     if (load) {
>> +             long tg_load;
>> +
>> +             /* Get tg's load and ensure tg_load > 0 */
>> +             tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
>> +
>> +             /* Ensure tg_load >= load and updated with current load*/
>> +             tg_load -= gcfs_rq->tg_load_avg_contrib;
>> +             tg_load += load;
>> +
>> +             /* scale gcfs_rq's load into tg's shares*/
>> +             load *= scale_load_down(gcfs_rq->tg->shares);
>> +             load /= tg_load;
>> +
>> +             /*
>> +              * we need to compute a correction term in the case that the
>> +              * task group is consuming <1 cpu so that we would contribute
>> +              * the same load as a task of equal weight.
>> +             */
>> +             if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
>> +                     load *= tg_load;
>> +                     load /= scale_load_down(gcfs_rq->tg->shares);
>> +             }
>> +     }
>> +
>> +     delta = load - se->avg.load_avg;
>> +
>> +     /* Nothing to update */
>> +     if (!delta)
>> +             return;
>> +
>> +     /* Set new sched_entity's load */
>> +     se->avg.load_avg = load;
>> +     se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>> +
>> +     /* Update parent cfs_rq load */
>> +     cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg + delta, 0);
>
> And this..
>
>> +     cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> +}
>
> Re-introduce the issue from: 897418922215 ("sched/fair: Fix cfs_rq avg
> tracking underflow").


Yes sorry, i forgot this point. I will use sub_positive instead

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/7 v3] sched: factorize PELT update
  2016-09-12  7:47 ` [PATCH 3/7 v3] sched: factorize PELT update Vincent Guittot
@ 2016-09-15 13:09   ` Peter Zijlstra
  2016-09-15 13:30     ` Vincent Guittot
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 13:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, yuyang.du, Morten.Rasmussen, linaro-kernel,
	dietmar.eggemann, pjt, bsegall

On Mon, Sep 12, 2016 at 09:47:48AM +0200, Vincent Guittot wrote:
> @@ -3690,7 +3658,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	/*
>  	 * Ensure that runnable average is periodically updated.
>  	 */
> -	update_load_avg(curr, 1);
> +	update_load_avg(curr, 1, 0);
>  	update_cfs_shares(cfs_rq);
>  

I would find something like: update_load_avg(curr, UPDATE_TG), eg, make
the second argument a bitflag instead of two arguments, much more
readable.

Do however check that it doesn't generate retarded code if you do that.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 12:59   ` Peter Zijlstra
@ 2016-09-15 13:11     ` Vincent Guittot
  0 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-15 13:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 15 September 2016 at 14:59, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 12, 2016 at 09:47:49AM +0200, Vincent Guittot wrote:
>> +     /* If the load of group cfs_rq is null, the load of the
>> +      * sched_entity will also be null so we can skip the formula
>> +      */
>
> https://lkml.kernel.org/r/CA+55aFyQYJerovMsSoSKS7PessZBr4vNp-3QUUwhqk4A4_jcbg@mail.gmail.com
>

I'm going to fix it right now.

I thought that checkpatch.pl would have raised a warning for this
comment style issue

>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
  2016-09-15 12:55   ` Peter Zijlstra
  2016-09-15 12:59   ` Peter Zijlstra
@ 2016-09-15 13:11   ` Dietmar Eggemann
  2016-09-15 14:31     ` Vincent Guittot
  2016-09-15 15:14     ` Peter Zijlstra
  2016-09-15 14:43   ` Peter Zijlstra
  2016-09-19  3:19   ` Wanpeng Li
  4 siblings, 2 replies; 41+ messages in thread
From: Dietmar Eggemann @ 2016-09-15 13:11 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, yuyang.du,
	Morten.Rasmussen
  Cc: linaro-kernel, pjt, bsegall

On 12/09/16 08:47, Vincent Guittot wrote:
> When a task moves from/to a cfs_rq, we set a flag which is then used to
> propagate the change at parent level (sched_entity and cfs_rq) during
> next update. If the cfs_rq is throttled, the flag will stay pending until
> the cfs_rw is unthrottled.
> 
> For propagating the utilization, we copy the utilization of child cfs_rq to

s/child/group ?

> the sched_entity.
> 
> For propagating the load, we have to take into account the load of the
> whole task group in order to evaluate the load of the sched_entity.
> Similarly to what was done before the rewrite of PELT, we add a correction
> factor in case the task group's load is less than its share so it will
> contribute the same load of a task of equal weight.

What about cfs_rq->runnable_load_avg?

[...]

> +/* Take into account change of load of a child task group */
> +static inline void
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> +	long delta, load = gcfs_rq->avg.load_avg;
> +
> +	/* If the load of group cfs_rq is null, the load of the
> +	 * sched_entity will also be null so we can skip the formula
> +	 */
> +	if (load) {
> +		long tg_load;
> +
> +		/* Get tg's load and ensure tg_load > 0 */
> +		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> +
> +		/* Ensure tg_load >= load and updated with current load*/
> +		tg_load -= gcfs_rq->tg_load_avg_contrib;
> +		tg_load += load;
> +
> +		/* scale gcfs_rq's load into tg's shares*/
> +		load *= scale_load_down(gcfs_rq->tg->shares);
> +		load /= tg_load;
> +
> +		/*
> +		 * we need to compute a correction term in the case that the
> +		 * task group is consuming <1 cpu so that we would contribute
> +		 * the same load as a task of equal weight.

Wasn't 'consuming <1' related to 'NICE_0_LOAD' and not
scale_load_down(gcfs_rq->tg->shares) before the rewrite of PELT (v4.2,
__update_group_entity_contrib())?

> +		*/
> +		if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
> +			load *= tg_load;
> +			load /= scale_load_down(gcfs_rq->tg->shares);
> +		}
> +	}

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-12  7:47 ` [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class Vincent Guittot
@ 2016-09-15 13:18   ` Peter Zijlstra
  2016-09-15 15:36     ` Vincent Guittot
  2016-09-16 10:51   ` Peter Zijlstra
  2016-09-30 12:01   ` [tip:sched/core] sched/core: Fix incorrect " tip-bot for Vincent Guittot
  2 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 13:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, yuyang.du, Morten.Rasmussen, linaro-kernel,
	dietmar.eggemann, pjt, bsegall

On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:
> When a task switches to fair scheduling class, the period between now and
> the last update of its utilization is accounted as running time whatever
> happened during this period. This wrong accounting applies to the task
> and also to the task group branch.
> 
> When changing the property of a running task like its list of allowed CPUs
> or its scheduling class, we follow the sequence:
> -dequeue task
> -put task
> -change the property
> -set task as current task
> -enqueue task
> 
> The end of the sequence doesn't follow the normal sequence which is :
> -enqueue a task
> -then set the task as current task.
> 
> This wrong ordering is the root cause of wrong utilization accounting.
> Update the sequence to follow the right one:
> -dequeue task
> -put task
> -change the property
> -enqueue task
> -set task as current task

But enqueue_entity depends on cfs_rq->curr, which is set by
set_curr_task_fair().

Also, the normalize comment in dequeue_entity() worries me, 'someone'
didn't update that when he moved update_min_vruntime() around.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/7 v3] sched: factorize PELT update
  2016-09-15 13:09   ` Peter Zijlstra
@ 2016-09-15 13:30     ` Vincent Guittot
  0 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-15 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 15 September 2016 at 15:09, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 12, 2016 at 09:47:48AM +0200, Vincent Guittot wrote:
>> @@ -3690,7 +3658,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>       /*
>>        * Ensure that runnable average is periodically updated.
>>        */
>> -     update_load_avg(curr, 1);
>> +     update_load_avg(curr, 1, 0);
>>       update_cfs_shares(cfs_rq);
>>
>
> I would find something like: update_load_avg(curr, UPDATE_TG), eg, make
> the second argument a bitflag instead of two arguments, much more
> readable.

OK, I'm going to use bitflag
>
> Do however check that it doesn't generate retarded code if you do that.

OK

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 13:11   ` Dietmar Eggemann
@ 2016-09-15 14:31     ` Vincent Guittot
  2016-09-15 17:20       ` Dietmar Eggemann
  2016-09-15 15:14     ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-15 14:31 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Paul Turner,
	Benjamin Segall

On 15 September 2016 at 15:11, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> On 12/09/16 08:47, Vincent Guittot wrote:
>> When a task moves from/to a cfs_rq, we set a flag which is then used to
>> propagate the change at parent level (sched_entity and cfs_rq) during
>> next update. If the cfs_rq is throttled, the flag will stay pending until
>> the cfs_rw is unthrottled.
>>
>> For propagating the utilization, we copy the utilization of child cfs_rq to
>
> s/child/group ?
>
>> the sched_entity.
>>
>> For propagating the load, we have to take into account the load of the
>> whole task group in order to evaluate the load of the sched_entity.
>> Similarly to what was done before the rewrite of PELT, we add a correction
>> factor in case the task group's load is less than its share so it will
>> contribute the same load of a task of equal weight.
>
> What about cfs_rq->runnable_load_avg?

sched_entity's load is updated before being enqueued so the up to date
value will be added to cfs_rq->runnable_load_avg... Unless se is
already enqueued ... so cfs_rq->runnable_load_avg should also be
updated is se is already on_rq. I'm going to add this case

Thanks for pointing this case

>
> [...]
>
>> +/* Take into account change of load of a child task group */
>> +static inline void
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +{
>> +     struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> +     long delta, load = gcfs_rq->avg.load_avg;
>> +
>> +     /* If the load of group cfs_rq is null, the load of the
>> +      * sched_entity will also be null so we can skip the formula
>> +      */
>> +     if (load) {
>> +             long tg_load;
>> +
>> +             /* Get tg's load and ensure tg_load > 0 */
>> +             tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
>> +
>> +             /* Ensure tg_load >= load and updated with current load*/
>> +             tg_load -= gcfs_rq->tg_load_avg_contrib;
>> +             tg_load += load;
>> +
>> +             /* scale gcfs_rq's load into tg's shares*/
>> +             load *= scale_load_down(gcfs_rq->tg->shares);
>> +             load /= tg_load;
>> +
>> +             /*
>> +              * we need to compute a correction term in the case that the
>> +              * task group is consuming <1 cpu so that we would contribute
>> +              * the same load as a task of equal weight.
>
> Wasn't 'consuming <1' related to 'NICE_0_LOAD' and not
> scale_load_down(gcfs_rq->tg->shares) before the rewrite of PELT (v4.2,
> __update_group_entity_contrib())?

Yes before the rewrite, the condition (tg->runnable_avg < NICE_0_LOAD) was used.

I have used the following examples to choose the condition:

A task group with only one always running task TA with a weight equals
to tg->shares, will have a tg's load (cfs_rq->tg->load_avg) equals to
TA's weight == scale_load_down(tg->shares): The load of the CPU on
which the task runs, will be scale_load_down(task's weight) ==
scale_load_down(tg->shares) and the load of others CPUs will be null.
In this case, all shares will be given to cfs_rq CFS1 on which TA runs
and the load of the sched_entity SB that represents CFS1 at parent
level will be scale_load_down(SB's weight) =
scale_load_down(tg->shares).

If the TA is not an always running task, its load will be less than
its weight and less than scale_load_down(tg->shares) and as a result
tg->load_avg will be less than scale_load_down(tg->shares).
Nevertheless, the weight of SB is still scale_load_down(tg->shares)
and its load should be the same as TA. But the 1st part of the
calculation gives a load of scale_load_down(gcfs_rq->tg->shares)
because tg_load == gcfs_rq->tg_load_avg_contrib == load. So if tg_load
< scale_load_down(gcfs_rq->tg->shares), we have to correct the load
that we set to SEB

>
>> +             */
>> +             if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
>> +                     load *= tg_load;
>> +                     load /= scale_load_down(gcfs_rq->tg->shares);
>> +             }
>> +     }
>
> [...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
                     ` (2 preceding siblings ...)
  2016-09-15 13:11   ` Dietmar Eggemann
@ 2016-09-15 14:43   ` Peter Zijlstra
  2016-09-15 14:51     ` Vincent Guittot
  2016-09-19  3:19   ` Wanpeng Li
  4 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 14:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, yuyang.du, Morten.Rasmussen, linaro-kernel,
	dietmar.eggemann, pjt, bsegall

On Mon, Sep 12, 2016 at 09:47:49AM +0200, Vincent Guittot wrote:
> +static inline void
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> +	long delta, load = gcfs_rq->avg.load_avg;
> +
> +	/* If the load of group cfs_rq is null, the load of the
> +	 * sched_entity will also be null so we can skip the formula
> +	 */
> +	if (load) {
> +		long tg_load;
> +
> +		/* Get tg's load and ensure tg_load > 0 */
> +		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> +
> +		/* Ensure tg_load >= load and updated with current load*/
> +		tg_load -= gcfs_rq->tg_load_avg_contrib;
> +		tg_load += load;
> +
> +		/* scale gcfs_rq's load into tg's shares*/
> +		load *= scale_load_down(gcfs_rq->tg->shares);
> +		load /= tg_load;
> +
> +		/*
> +		 * we need to compute a correction term in the case that the
> +		 * task group is consuming <1 cpu so that we would contribute
> +		 * the same load as a task of equal weight.
> +		*/
> +		if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
> +			load *= tg_load;
> +			load /= scale_load_down(gcfs_rq->tg->shares);
> +		}

Note that you're reversing the exact scaling you just applied.

That is:
               shares    tg_load
	load * ------- * ------- == load
               tg_load   shares

> +	}

So something like:

	shares = scale_load_down(gcfs_rq->tg->shares);

	if (tg_load >= shares) {
		load *= shares;
		load /= tg_load;
	}

should be the same as the above and saves a bunch of math, no?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 14:43   ` Peter Zijlstra
@ 2016-09-15 14:51     ` Vincent Guittot
  0 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-15 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 15 September 2016 at 16:43, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 12, 2016 at 09:47:49AM +0200, Vincent Guittot wrote:
>> +static inline void
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +{
>> +     struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> +     long delta, load = gcfs_rq->avg.load_avg;
>> +
>> +     /* If the load of group cfs_rq is null, the load of the
>> +      * sched_entity will also be null so we can skip the formula
>> +      */
>> +     if (load) {
>> +             long tg_load;
>> +
>> +             /* Get tg's load and ensure tg_load > 0 */
>> +             tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
>> +
>> +             /* Ensure tg_load >= load and updated with current load*/
>> +             tg_load -= gcfs_rq->tg_load_avg_contrib;
>> +             tg_load += load;
>> +
>> +             /* scale gcfs_rq's load into tg's shares*/
>> +             load *= scale_load_down(gcfs_rq->tg->shares);
>> +             load /= tg_load;
>> +
>> +             /*
>> +              * we need to compute a correction term in the case that the
>> +              * task group is consuming <1 cpu so that we would contribute
>> +              * the same load as a task of equal weight.
>> +             */
>> +             if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
>> +                     load *= tg_load;
>> +                     load /= scale_load_down(gcfs_rq->tg->shares);
>> +             }
>
> Note that you're reversing the exact scaling you just applied.

Yes, Indeed

>
> That is:
>                shares    tg_load
>         load * ------- * ------- == load
>                tg_load   shares
>
>> +     }
>
> So something like:
>
>         shares = scale_load_down(gcfs_rq->tg->shares);
>
>         if (tg_load >= shares) {
>                 load *= shares;
>                 load /= tg_load;
>         }
>
> should be the same as the above and saves a bunch of math, no?

Yes

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 13:11   ` Dietmar Eggemann
  2016-09-15 14:31     ` Vincent Guittot
@ 2016-09-15 15:14     ` Peter Zijlstra
  2016-09-15 17:36       ` Dietmar Eggemann
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 15:14 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, linux-kernel, yuyang.du,
	Morten.Rasmussen, linaro-kernel, pjt, bsegall

On Thu, Sep 15, 2016 at 02:11:49PM +0100, Dietmar Eggemann wrote:
> On 12/09/16 08:47, Vincent Guittot wrote:

> > +/* Take into account change of load of a child task group */
> > +static inline void
> > +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > +	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> > +	long delta, load = gcfs_rq->avg.load_avg;
> > +
> > +	/* If the load of group cfs_rq is null, the load of the
> > +	 * sched_entity will also be null so we can skip the formula
> > +	 */
> > +	if (load) {
> > +		long tg_load;
> > +
> > +		/* Get tg's load and ensure tg_load > 0 */
> > +		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> > +
> > +		/* Ensure tg_load >= load and updated with current load*/
> > +		tg_load -= gcfs_rq->tg_load_avg_contrib;
> > +		tg_load += load;
> > +
> > +		/* scale gcfs_rq's load into tg's shares*/
> > +		load *= scale_load_down(gcfs_rq->tg->shares);
> > +		load /= tg_load;
> > +
> > +		/*
> > +		 * we need to compute a correction term in the case that the
> > +		 * task group is consuming <1 cpu so that we would contribute
> > +		 * the same load as a task of equal weight.
> 
> Wasn't 'consuming <1' related to 'NICE_0_LOAD' and not
> scale_load_down(gcfs_rq->tg->shares) before the rewrite of PELT (v4.2,
> __update_group_entity_contrib())?


So the approximation was: min(1, runnable_avg) * shares;

And it just so happened that we tracked runnable_avg in 10 bit fixed
point, which then happened to be NICE_0_LOAD.

But here we have load_avg, which already includes a '* shares' factor.
So that then becomes min(shares, load_avg).

We did however loose a lot on why and how min(1, runnable_avg) is a
sensible thing to do...

> > +		*/
> > +		if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
> > +			load *= tg_load;
> > +			load /= scale_load_down(gcfs_rq->tg->shares);
> > +		}
> > +	}
> 
> [...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-15 13:18   ` Peter Zijlstra
@ 2016-09-15 15:36     ` Vincent Guittot
  2016-09-16 12:16       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-15 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 15 September 2016 at 15:18, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:
>> When a task switches to fair scheduling class, the period between now and
>> the last update of its utilization is accounted as running time whatever
>> happened during this period. This wrong accounting applies to the task
>> and also to the task group branch.
>>
>> When changing the property of a running task like its list of allowed CPUs
>> or its scheduling class, we follow the sequence:
>> -dequeue task
>> -put task
>> -change the property
>> -set task as current task
>> -enqueue task
>>
>> The end of the sequence doesn't follow the normal sequence which is :
>> -enqueue a task
>> -then set the task as current task.
>>
>> This wrong ordering is the root cause of wrong utilization accounting.
>> Update the sequence to follow the right one:
>> -dequeue task
>> -put task
>> -change the property
>> -enqueue task
>> -set task as current task
>
> But enqueue_entity depends on cfs_rq->curr, which is set by
> set_curr_task_fair().

With this sequence, cfs_rq->curr is null and the cfs_rq is "idle" as
the entity has been dequeued and put back in the rb tree the time to
change the properties.

enqueue_entity use cfs_rq->cur == se for:
- updating current. With this sequence, current is now null so nothing to do
- to skip the enqueue of the se in rb tree. With this sequence, se is
put in the rb tree during the enqueue and take back during the set
task as current task

I don't see any functional issue but we are not doing the same step
with the new sequence

>
> Also, the normalize comment in dequeue_entity() worries me, 'someone'
> didn't update that when he moved update_min_vruntime() around.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 14:31     ` Vincent Guittot
@ 2016-09-15 17:20       ` Dietmar Eggemann
  0 siblings, 0 replies; 41+ messages in thread
From: Dietmar Eggemann @ 2016-09-15 17:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Paul Turner,
	Benjamin Segall

On 15/09/16 15:31, Vincent Guittot wrote:
> On 15 September 2016 at 15:11, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:

[...]

>> Wasn't 'consuming <1' related to 'NICE_0_LOAD' and not
>> scale_load_down(gcfs_rq->tg->shares) before the rewrite of PELT (v4.2,
>> __update_group_entity_contrib())?
> 
> Yes before the rewrite, the condition (tg->runnable_avg < NICE_0_LOAD) was used.
> 
> I have used the following examples to choose the condition:
> 
> A task group with only one always running task TA with a weight equals
> to tg->shares, will have a tg's load (cfs_rq->tg->load_avg) equals to
> TA's weight == scale_load_down(tg->shares): The load of the CPU on
> which the task runs, will be scale_load_down(task's weight) ==
> scale_load_down(tg->shares) and the load of others CPUs will be null.
> In this case, all shares will be given to cfs_rq CFS1 on which TA runs
> and the load of the sched_entity SB that represents CFS1 at parent
> level will be scale_load_down(SB's weight) =
> scale_load_down(tg->shares).
> 
> If the TA is not an always running task, its load will be less than
> its weight and less than scale_load_down(tg->shares) and as a result
> tg->load_avg will be less than scale_load_down(tg->shares).
> Nevertheless, the weight of SB is still scale_load_down(tg->shares)
> and its load should be the same as TA. But the 1st part of the
> calculation gives a load of scale_load_down(gcfs_rq->tg->shares)
> because tg_load == gcfs_rq->tg_load_avg_contrib == load. So if tg_load
> < scale_load_down(gcfs_rq->tg->shares), we have to correct the load
> that we set to SEB

Makes sense to me now. Thanks. Peter already pointed out that this math
can be made easier, so you will probably 'scale gcfs_rq's load into tg's
shares' only if 'tg_load >= shares''

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 15:14     ` Peter Zijlstra
@ 2016-09-15 17:36       ` Dietmar Eggemann
  2016-09-15 17:54         ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Dietmar Eggemann @ 2016-09-15 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, yuyang.du,
	Morten.Rasmussen, linaro-kernel, pjt, bsegall

On 15/09/16 16:14, Peter Zijlstra wrote:
> On Thu, Sep 15, 2016 at 02:11:49PM +0100, Dietmar Eggemann wrote:
>> On 12/09/16 08:47, Vincent Guittot wrote:
> 
>>> +/* Take into account change of load of a child task group */
>>> +static inline void
>>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> +{
>>> +	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>> +	long delta, load = gcfs_rq->avg.load_avg;
>>> +
>>> +	/* If the load of group cfs_rq is null, the load of the
>>> +	 * sched_entity will also be null so we can skip the formula
>>> +	 */
>>> +	if (load) {
>>> +		long tg_load;
>>> +
>>> +		/* Get tg's load and ensure tg_load > 0 */
>>> +		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
>>> +
>>> +		/* Ensure tg_load >= load and updated with current load*/
>>> +		tg_load -= gcfs_rq->tg_load_avg_contrib;
>>> +		tg_load += load;
>>> +
>>> +		/* scale gcfs_rq's load into tg's shares*/
>>> +		load *= scale_load_down(gcfs_rq->tg->shares);
>>> +		load /= tg_load;
>>> +
>>> +		/*
>>> +		 * we need to compute a correction term in the case that the
>>> +		 * task group is consuming <1 cpu so that we would contribute
>>> +		 * the same load as a task of equal weight.
>>
>> Wasn't 'consuming <1' related to 'NICE_0_LOAD' and not
>> scale_load_down(gcfs_rq->tg->shares) before the rewrite of PELT (v4.2,
>> __update_group_entity_contrib())?
> 
> 
> So the approximation was: min(1, runnable_avg) * shares;
> 
> And it just so happened that we tracked runnable_avg in 10 bit fixed
> point, which then happened to be NICE_0_LOAD.
> 
> But here we have load_avg, which already includes a '* shares' factor.
> So that then becomes min(shares, load_avg).

Makes sense, understand it now.

> We did however loose a lot on why and how min(1, runnable_avg) is a
> sensible thing to do...

Do you refer to the big comment on top of this if condition in the old
code in __update_group_entity_contrib()? The last two subsections of it
I never understood ...

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-15 17:36       ` Dietmar Eggemann
@ 2016-09-15 17:54         ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-15 17:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, linux-kernel, yuyang.du,
	Morten.Rasmussen, linaro-kernel, pjt, bsegall

On Thu, Sep 15, 2016 at 06:36:53PM +0100, Dietmar Eggemann wrote:

> > We did however loose a lot on why and how min(1, runnable_avg) is a
> > sensible thing to do...
> 
> Do you refer to the big comment on top of this if condition in the old
> code in __update_group_entity_contrib()? The last two subsections of it
> I never understood ...
> 
> [...]

Read the thread here:

  http://www.gossamer-threads.com/lists/linux/kernel/1558210

But Vince's explanation earlier is also good.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-12  7:47 ` [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class Vincent Guittot
  2016-09-15 13:18   ` Peter Zijlstra
@ 2016-09-16 10:51   ` Peter Zijlstra
  2016-09-16 12:45     ` Vincent Guittot
  2016-09-30 12:01   ` [tip:sched/core] sched/core: Fix incorrect " tip-bot for Vincent Guittot
  2 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-16 10:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, yuyang.du, Morten.Rasmussen, linaro-kernel,
	dietmar.eggemann, pjt, bsegall

On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:

> -dequeue task
> -put task
> -change the property
> -enqueue task
> -set task as current task

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e52d08..7a9c9b9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1105,10 +1105,10 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
>  
>  	p->sched_class->set_cpus_allowed(p, new_mask);
>  
> -	if (running)
> -		p->sched_class->set_curr_task(rq);
>  	if (queued)
>  		enqueue_task(rq, p, ENQUEUE_RESTORE);
> +	if (running)
> +		p->sched_class->set_curr_task(rq);
>  }
>  
>  /*

So one thing that I've wanted to do for a while, but never managed to
come up with a sensible way to do is encapsulate this pattern.

The two options I came up with are:

#define FOO(p, stmt)
({
	struct rq *rq = task_rq(p);
	bool queued = task_on_rq_queued(p);
	bool running = task_current(rq);
	int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */

	if (queued)
		dequeue_task(rq, p, queue_flags);
	if (running)
		put_prev_task(rq, p);

	stmt;

	if (queued)
		enqueue_task(rq, p, queue_flags);
	if (running)
		set_curr_task(rq, p);
})

and

void foo(struct task_struct *p, void (*func)(struct task_struct *, int *))
{
	struct rq *rq = task_rq(p);
	bool queued = task_on_rq_queued(p);
	bool running = task_current(rq);
	int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */

	if (queued)
		dequeue_task(rq, p, queue_flags);
	if (running)
		put_prev_task(rq, p);

	func(p, &queue_flags);

	if (queued)
		enqueue_task(rq, p, queue_flags);
	if (running)
		set_curr_task(rq, p);
}

Neither results in particularly pretty code. Although I suppose if I'd
have to pick one I'd go for the macro variant.

Opinions? I'm fine with leaving the code as is, just wanted to throw
this out there.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-15 15:36     ` Vincent Guittot
@ 2016-09-16 12:16       ` Peter Zijlstra
  2016-09-16 14:23         ` Vincent Guittot
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-16 12:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On Thu, Sep 15, 2016 at 05:36:58PM +0200, Vincent Guittot wrote:
> On 15 September 2016 at 15:18, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:

> >> Update the sequence to follow the right one:
> >> -dequeue task
> >> -put task
> >> -change the property
> >> -enqueue task
> >> -set task as current task
> >
> > But enqueue_entity depends on cfs_rq->curr, which is set by
> > set_curr_task_fair().
> 
> With this sequence, cfs_rq->curr is null and the cfs_rq is "idle" as
> the entity has been dequeued and put back in the rb tree the time to
> change the properties.
> 
> enqueue_entity use cfs_rq->cur == se for:
> - updating current. With this sequence, current is now null so nothing to do
> - to skip the enqueue of the se in rb tree. With this sequence, se is
> put in the rb tree during the enqueue and take back during the set
> task as current task
> 
> I don't see any functional issue but we are not doing the same step
> with the new sequence

So I think you're right in that it should work.

I also think we can then simplify enqueue_entity() in that it will never
be possible to enqueue current with your change.

But my brain just isn't working today, so who knows.

> > Also, the normalize comment in dequeue_entity() worries me, 'someone'
> > didn't update that when he moved update_min_vruntime() around.

I now worry more, so we do:

	dequeue_task := dequeue_task_fair (p == current)
	  dequeue_entity
	    update_curr()
	      update_min_vruntime()
	    vruntime -= min_vruntime
	    update_min_vruntime()
	      // use cfs_rq->curr, which we just normalized !

	put_prev_task := put_prev_task_fair
	  put_prev_entity
	    cfs_rq->curr = NULL;


Now the point of the latter update_min_vruntime() is to advance
min_vruntime when the task we removed was the one holding it back.

However, it means that if we do dequeue+enqueue, we're further in the
future (ie. we get penalized).

So I'm inclined to simply remove the (2nd) update_min_vruntime() call.
But as said above, my brain isn't co-operating much today.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-16 10:51   ` Peter Zijlstra
@ 2016-09-16 12:45     ` Vincent Guittot
  0 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-16 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 16 September 2016 at 12:51, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:
>
> > -dequeue task
> > -put task
> > -change the property
> > -enqueue task
> > -set task as current task
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3e52d08..7a9c9b9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1105,10 +1105,10 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
> >
> >       p->sched_class->set_cpus_allowed(p, new_mask);
> >
> > -     if (running)
> > -             p->sched_class->set_curr_task(rq);
> >       if (queued)
> >               enqueue_task(rq, p, ENQUEUE_RESTORE);
> > +     if (running)
> > +             p->sched_class->set_curr_task(rq);
> >  }
> >
> >  /*
>
> So one thing that I've wanted to do for a while, but never managed to
> come up with a sensible way to do is encapsulate this pattern.
>
> The two options I came up with are:
>
> #define FOO(p, stmt)
> ({
>         struct rq *rq = task_rq(p);
>         bool queued = task_on_rq_queued(p);
>         bool running = task_current(rq);
>         int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */
>
>         if (queued)
>                 dequeue_task(rq, p, queue_flags);
>         if (running)
>                 put_prev_task(rq, p);
>
>         stmt;
>
>         if (queued)
>                 enqueue_task(rq, p, queue_flags);
>         if (running)
>                 set_curr_task(rq, p);
> })
>
> and
>
> void foo(struct task_struct *p, void (*func)(struct task_struct *, int *))
> {
>         struct rq *rq = task_rq(p);
>         bool queued = task_on_rq_queued(p);
>         bool running = task_current(rq);
>         int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */
>
>         if (queued)
>                 dequeue_task(rq, p, queue_flags);
>         if (running)
>                 put_prev_task(rq, p);
>
>         func(p, &queue_flags);
>
>         if (queued)
>                 enqueue_task(rq, p, queue_flags);
>         if (running)
>                 set_curr_task(rq, p);
> }
>
> Neither results in particularly pretty code. Although I suppose if I'd
> have to pick one I'd go for the macro variant.
>
> Opinions? I'm fine with leaving the code as is, just wanted to throw
> this out there.

I'm not convinced by using such encapsulation as it adds the
constraint of having a function to pass which is not always the case
and it hides a bit whats happen to this function
What about creating a task_FOO_save and a task_FOO_save macro  ? something like

#define task_FOO_save(p, rq, flags)
({
        bool queued = task_on_rq_queued(p);
        bool running = task_current(rq);
        int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */

        if (queued)
                dequeue_task(rq, p, queue_flags);
        if (running)
                put_prev_task(rq, p);
        flags = queued | running << 1;
})

#define task_FOO_restore(p, rq, flags)
({
        bool queued = flags & 0x1;
        bool running = flags & 0x2;
        int queue_flags = ENQUEUE_RESTORE;

        if (queued)
                enqueue_task(rq, p, queue_flags);
        if (running)
                set_curr_task(rq, p);
})

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-16 12:16       ` Peter Zijlstra
@ 2016-09-16 14:23         ` Vincent Guittot
  2016-09-20 11:54           ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-16 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 16 September 2016 at 14:16, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 15, 2016 at 05:36:58PM +0200, Vincent Guittot wrote:
>> On 15 September 2016 at 15:18, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:
>
>> >> Update the sequence to follow the right one:
>> >> -dequeue task
>> >> -put task
>> >> -change the property
>> >> -enqueue task
>> >> -set task as current task
>> >
>> > But enqueue_entity depends on cfs_rq->curr, which is set by
>> > set_curr_task_fair().
>>
>> With this sequence, cfs_rq->curr is null and the cfs_rq is "idle" as
>> the entity has been dequeued and put back in the rb tree the time to
>> change the properties.
>>
>> enqueue_entity use cfs_rq->cur == se for:
>> - updating current. With this sequence, current is now null so nothing to do
>> - to skip the enqueue of the se in rb tree. With this sequence, se is
>> put in the rb tree during the enqueue and take back during the set
>> task as current task
>>
>> I don't see any functional issue but we are not doing the same step
>> with the new sequence
>
> So I think you're right in that it should work.
>
> I also think we can then simplify enqueue_entity() in that it will never
> be possible to enqueue current with your change.
>
> But my brain just isn't working today, so who knows.
>
>> > Also, the normalize comment in dequeue_entity() worries me, 'someone'
>> > didn't update that when he moved update_min_vruntime() around.
>
> I now worry more, so we do:
>
>         dequeue_task := dequeue_task_fair (p == current)
>           dequeue_entity
>             update_curr()
>               update_min_vruntime()
>             vruntime -= min_vruntime
>             update_min_vruntime()
>               // use cfs_rq->curr, which we just normalized !

yes but does it really change the cfs_rq->min_vruntime in this case ?

If curr is the task with the smallest vruntime of the cfs_rq,
cfs_rq->min_vruntime has been aligned with curr->vruntime during
update_curr(). So vruntime -= min_vruntime will be for sure less than
cfs_rq->min_vruntime and cfs_rq->min_vruntime stays unchanged

If curr is not the task with the smallest vruntime of the cfs_rq,
cfs_rq->min_vruntime has been aligned with the left most entity. And
vruntime -= min_vruntime will not change anything during the 2nd
update_min_vruntime as it will be either greater than
leftmost->vruntime or less than cfs_rq->min_vruntime.

>
>         put_prev_task := put_prev_task_fair
>           put_prev_entity
>             cfs_rq->curr = NULL;
>
>
> Now the point of the latter update_min_vruntime() is to advance
> min_vruntime when the task we removed was the one holding it back.
>
> However, it means that if we do dequeue+enqueue, we're further in the
> future (ie. we get penalized).
>
> So I'm inclined to simply remove the (2nd) update_min_vruntime() call.
> But as said above, my brain isn't co-operating much today.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach
  2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
                     ` (3 preceding siblings ...)
  2016-09-15 14:43   ` Peter Zijlstra
@ 2016-09-19  3:19   ` Wanpeng Li
  4 siblings, 0 replies; 41+ messages in thread
From: Wanpeng Li @ 2016-09-19  3:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, linaro-kernel, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

2016-09-12 15:47 GMT+08:00 Vincent Guittot <vincent.guittot@linaro.org>:
> When a task moves from/to a cfs_rq, we set a flag which is then used to
> propagate the change at parent level (sched_entity and cfs_rq) during
> next update. If the cfs_rq is throttled, the flag will stay pending until
> the cfs_rw is unthrottled.
>
> For propagating the utilization, we copy the utilization of child cfs_rq to
> the sched_entity.
>
> For propagating the load, we have to take into account the load of the
> whole task group in order to evaluate the load of the sched_entity.
> Similarly to what was done before the rewrite of PELT, we add a correction
> factor in case the task group's load is less than its share so it will
> contribute the same load of a task of equal weight.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c  | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h |   1 +
>  2 files changed, 170 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0aa1d7d..e4015f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3017,6 +3017,132 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>         }
>  }
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +/* Take into account change of utilization of a child task group */
> +static inline void
> +update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       struct cfs_rq *gcfs_rq =  group_cfs_rq(se);
> +       long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
> +
> +       /* Nothing to update */
> +       if (!delta)
> +               return;
> +
> +       /* Set new sched_entity's utilizaton */

s/utilizaton/utilization

> +       se->avg.util_avg = gcfs_rq->avg.util_avg;
> +       se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> +
> +       /* Update parent cfs_rq utilization */
> +       cfs_rq->avg.util_avg =  max_t(long, cfs_rq->avg.util_avg + delta, 0);
> +       cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
> +}
> +
> +/* Take into account change of load of a child task group */
> +static inline void
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> +       long delta, load = gcfs_rq->avg.load_avg;
> +
> +       /* If the load of group cfs_rq is null, the load of the
> +        * sched_entity will also be null so we can skip the formula
> +        */
> +       if (load) {
> +               long tg_load;
> +
> +               /* Get tg's load and ensure tg_load > 0 */
> +               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> +
> +               /* Ensure tg_load >= load and updated with current load*/
> +               tg_load -= gcfs_rq->tg_load_avg_contrib;
> +               tg_load += load;
> +
> +               /* scale gcfs_rq's load into tg's shares*/
> +               load *= scale_load_down(gcfs_rq->tg->shares);
> +               load /= tg_load;
> +
> +               /*
> +                * we need to compute a correction term in the case that the
> +                * task group is consuming <1 cpu so that we would contribute
> +                * the same load as a task of equal weight.
> +               */
> +               if (tg_load < scale_load_down(gcfs_rq->tg->shares)) {
> +                       load *= tg_load;
> +                       load /= scale_load_down(gcfs_rq->tg->shares);
> +               }
> +       }
> +
> +       delta = load - se->avg.load_avg;
> +
> +       /* Nothing to update */
> +       if (!delta)
> +               return;
> +
> +       /* Set new sched_entity's load */
> +       se->avg.load_avg = load;
> +       se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
> +
> +       /* Update parent cfs_rq load */
> +       cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg + delta, 0);
> +       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> +}
> +
> +static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
> +{
> +       /* set cfs_rq's flag */
> +       cfs_rq->propagate_avg = 1;
> +}
> +
> +static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
> +{
> +       /* Get my cfs_rq */
> +       struct cfs_rq *cfs_rq = group_cfs_rq(se);
> +
> +       /* Nothing to propagate */
> +       if (!cfs_rq->propagate_avg)
> +               return 0;
> +
> +       /* Clear my cfs_rq's flag */
> +       cfs_rq->propagate_avg = 0;
> +
> +       return 1;
> +}
> +
> +/* Update task and its cfs_rq load average */
> +static inline int propagate_entity_load_avg(struct sched_entity *se)
> +{
> +       struct cfs_rq *cfs_rq;
> +
> +       if (entity_is_task(se))
> +               return 0;
> +
> +       if (!test_and_clear_tg_cfs_propagate(se))
> +               return 0;
> +
> +       /* Get parent cfs_rq */
> +       cfs_rq = cfs_rq_of(se);
> +
> +       /* Propagate to parent */
> +       set_tg_cfs_propagate(cfs_rq);
> +
> +       /* Update utilization */
> +       update_tg_cfs_util(cfs_rq, se);
> +
> +       /* Update load */
> +       update_tg_cfs_load(cfs_rq, se);
> +
> +       return 1;
> +}
> +#else
> +static inline int propagate_entity_load_avg(struct sched_entity *se)
> +{
> +       return 0;
> +}
> +
> +static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
> +#endif
> +
>  /*
>   * Unsigned subtract and clamp on underflow.
>   *
> @@ -3093,6 +3219,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg,
>         u64 now = cfs_rq_clock_task(cfs_rq);
>         struct rq *rq = rq_of(cfs_rq);
>         int cpu = cpu_of(rq);
> +       int decayed;
>
>         /*
>          * Track task load average for carrying it to new CPU after migrated, and
> @@ -3103,7 +3230,11 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg,
>                           se->on_rq * scale_load_down(se->load.weight),
>                           cfs_rq->curr == se, NULL);
>
> -       if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
> +       decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
> +
> +       decayed |= propagate_entity_load_avg(se);
> +
> +       if (decayed && update_tg)
>                 update_tg_load_avg(cfs_rq, 0);
>  }
>
> @@ -3122,6 +3253,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         cfs_rq->avg.load_sum += se->avg.load_sum;
>         cfs_rq->avg.util_avg += se->avg.util_avg;
>         cfs_rq->avg.util_sum += se->avg.util_sum;
> +       set_tg_cfs_propagate(cfs_rq);
>
>         cfs_rq_util_change(cfs_rq);
>  }
> @@ -3141,6 +3273,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
>         sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
>         sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> +       set_tg_cfs_propagate(cfs_rq);
>
>         cfs_rq_util_change(cfs_rq);
>  }
> @@ -8499,6 +8632,22 @@ static void detach_task_cfs_rq(struct task_struct *p)
>         update_load_avg(se, 0, 0);
>         detach_entity_load_avg(cfs_rq, se);
>         update_tg_load_avg(cfs_rq, false);
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       /*
> +        * Propagate the detach across the tg tree to make it visible to the
> +        * root
> +        */
> +       se = se->parent;
> +       for_each_sched_entity(se) {
> +               cfs_rq = cfs_rq_of(se);
> +
> +               if (cfs_rq_throttled(cfs_rq))
> +                       break;
> +
> +               update_load_avg(se, 1, 0);
> +       }
> +#endif
>  }
>
>  static void attach_entity_cfs_rq(struct sched_entity *se)
> @@ -8517,6 +8666,22 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
>         update_load_avg(se, 0, !sched_feat(ATTACH_AGE_LOAD));
>         attach_entity_load_avg(cfs_rq, se);
>         update_tg_load_avg(cfs_rq, false);
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       /*
> +        * Propagate the attach across the tg tree to make it visible to the
> +        * root
> +        */
> +       se = se->parent;
> +       for_each_sched_entity(se) {
> +               cfs_rq = cfs_rq_of(se);
> +
> +               if (cfs_rq_throttled(cfs_rq))
> +                       break;
> +
> +               update_load_avg(se, 1, 0);
> +       }
> +#endif
>  }
>
>  static void attach_task_cfs_rq(struct task_struct *p)
> @@ -8578,6 +8743,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>         cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>  #endif
>  #ifdef CONFIG_SMP
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       cfs_rq->propagate_avg = 0;
> +#endif
>         atomic_long_set(&cfs_rq->removed_load_avg, 0);
>         atomic_long_set(&cfs_rq->removed_util_avg, 0);
>  #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 483616a..0517a9e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -397,6 +397,7 @@ struct cfs_rq {
>         unsigned long runnable_load_avg;
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         unsigned long tg_load_avg_contrib;
> +       unsigned long propagate_avg;
>  #endif
>         atomic_long_t removed_load_avg, removed_util_avg;
>  #ifndef CONFIG_64BIT
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-16 14:23         ` Vincent Guittot
@ 2016-09-20 11:54           ` Peter Zijlstra
  2016-09-20 13:06             ` Vincent Guittot
  2016-09-20 16:59             ` bsegall
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-20 11:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On Fri, Sep 16, 2016 at 04:23:16PM +0200, Vincent Guittot wrote:
> On 16 September 2016 at 14:16, Peter Zijlstra <peterz@infradead.org> wrote:

> >> > Also, the normalize comment in dequeue_entity() worries me, 'someone'
> >> > didn't update that when he moved update_min_vruntime() around.
> >
> > I now worry more, so we do:
> >
> >         dequeue_task := dequeue_task_fair (p == current)
> >           dequeue_entity
> >             update_curr()
> >               update_min_vruntime()
> >             vruntime -= min_vruntime
> >             update_min_vruntime()
> >               // use cfs_rq->curr, which we just normalized !
> 
> yes but does it really change the cfs_rq->min_vruntime in this case ?

So let me see; it does:

	vruntime = cfs_rq->min_vruntime;

	if (curr) // true
	  vruntime = curr->vruntime; // == vruntime - min_vruntime

	if (leftmost) // possible
	  if (curr) // true
	    vruntime = min_vruntime(vruntime, se->vruntime);
	      if (se->vruntime - (curr->vruntime - min_vruntime)) < 0 // false

	min_vruntime = max_vruntime(min_vruntime, vruntime);
	  if ((curr->vruntime - min_vruntime) - min_vruntime) > 0)


The problem is that double subtraction of min_vruntime can wrap.
The thing is, min_vruntime is the 0-point in our modular space, it
normalizes vruntime (ideally min_vruntime would be our 0-lag point,
resulting in vruntime - min_vruntime being the lag).

The moment min_vruntime grows past S64_MAX/2 -2*min_vruntime wraps into
positive space again and the test above becomes true and we'll select
the normalized @curr vruntime as new min_vruntime and weird stuff will
happen.


Also, even it things magically worked out, its still very icky to mix
the normalized vruntime into things.

> >         put_prev_task := put_prev_task_fair
> >           put_prev_entity
> >             cfs_rq->curr = NULL;
> >
> >
> > Now the point of the latter update_min_vruntime() is to advance
> > min_vruntime when the task we removed was the one holding it back.
> >
> > However, it means that if we do dequeue+enqueue, we're further in the
> > future (ie. we get penalized).
> >
> > So I'm inclined to simply remove the (2nd) update_min_vruntime() call.
> > But as said above, my brain isn't co-operating much today.

OK, so not sure we can actually remove it, we do want it to move
min_vruntime forward (sometimes). We just don't want it to do so when
DEQUEUE_SAVE -- we want to get back where we left off, nor do we want to
muck about with touching normalized values.

Another fun corner case is DEQUEUE_SLEEP; in that case we do not
normalize, but we still want advance min_vruntime if this was the one
holding it back.

I ended up with the below, but I'm not sure I like it much. Let me prod
a wee bit more to see if there's not something else we can do.

Google has this patch-set replacing min_vruntime with an actual global
0-lag, which greatly simplifies things. If only they'd post it sometime
:/ /me prods pjt and ben with a sharp stick :-)

---
 kernel/sched/fair.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 986c10c25176..77566a340cbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -462,17 +462,23 @@ static inline int entity_before(struct sched_entity *a,
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
+	struct sched_entity *curr = cfs_rq->curr;
+
 	u64 vruntime = cfs_rq->min_vruntime;
 
-	if (cfs_rq->curr)
-		vruntime = cfs_rq->curr->vruntime;
+	if (curr) {
+		if (curr->on_rq)
+			vruntime = curr->vruntime;
+		else
+			curr = NULL;
+	}
 
 	if (cfs_rq->rb_leftmost) {
 		struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
 						   struct sched_entity,
 						   run_node);
 
-		if (!cfs_rq->curr)
+		if (!curr)
 			vruntime = se->vruntime;
 		else
 			vruntime = min_vruntime(vruntime, se->vruntime);
@@ -3483,8 +3489,16 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
-	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
+
+	/*
+	 * Now advance min_vruntime if @se was the entity holding it back,
+	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
+	 * put back on, and if we advance min_vruntime, we'll be placed back
+	 * further than we started -- ie. we'll be penalized.
+	 */
+	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
+		update_min_vruntime(cfs_rq);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-20 11:54           ` Peter Zijlstra
@ 2016-09-20 13:06             ` Vincent Guittot
  2016-09-22 12:25               ` Peter Zijlstra
  2016-09-20 16:59             ` bsegall
  1 sibling, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-20 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On 20 September 2016 at 13:54, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 04:23:16PM +0200, Vincent Guittot wrote:
>> On 16 September 2016 at 14:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> >> > Also, the normalize comment in dequeue_entity() worries me, 'someone'
>> >> > didn't update that when he moved update_min_vruntime() around.
>> >
>> > I now worry more, so we do:
>> >
>> >         dequeue_task := dequeue_task_fair (p == current)
>> >           dequeue_entity
>> >             update_curr()
>> >               update_min_vruntime()
>> >             vruntime -= min_vruntime
>> >             update_min_vruntime()
>> >               // use cfs_rq->curr, which we just normalized !
>>
>> yes but does it really change the cfs_rq->min_vruntime in this case ?
>
> So let me see; it does:
>
>         vruntime = cfs_rq->min_vruntime;
>
>         if (curr) // true
>           vruntime = curr->vruntime; // == vruntime - min_vruntime
>
>         if (leftmost) // possible
>           if (curr) // true
>             vruntime = min_vruntime(vruntime, se->vruntime);
>               if (se->vruntime - (curr->vruntime - min_vruntime)) < 0 // false
>
>         min_vruntime = max_vruntime(min_vruntime, vruntime);
>           if ((curr->vruntime - min_vruntime) - min_vruntime) > 0)
>
>
> The problem is that double subtraction of min_vruntime can wrap.
> The thing is, min_vruntime is the 0-point in our modular space, it
> normalizes vruntime (ideally min_vruntime would be our 0-lag point,
> resulting in vruntime - min_vruntime being the lag).
>
> The moment min_vruntime grows past S64_MAX/2 -2*min_vruntime wraps into

fair enough

> positive space again and the test above becomes true and we'll select
> the normalized @curr vruntime as new min_vruntime and weird stuff will
> happen.
>
>
> Also, even it things magically worked out, its still very icky to mix
> the normalized vruntime into things.

I agree

>
>> >         put_prev_task := put_prev_task_fair
>> >           put_prev_entity
>> >             cfs_rq->curr = NULL;
>> >
>> >
>> > Now the point of the latter update_min_vruntime() is to advance
>> > min_vruntime when the task we removed was the one holding it back.
>> >
>> > However, it means that if we do dequeue+enqueue, we're further in the
>> > future (ie. we get penalized).
>> >
>> > So I'm inclined to simply remove the (2nd) update_min_vruntime() call.
>> > But as said above, my brain isn't co-operating much today.
>
> OK, so not sure we can actually remove it, we do want it to move
> min_vruntime forward (sometimes). We just don't want it to do so when
> DEQUEUE_SAVE -- we want to get back where we left off, nor do we want to
> muck about with touching normalized values.
>
> Another fun corner case is DEQUEUE_SLEEP; in that case we do not
> normalize, but we still want advance min_vruntime if this was the one
> holding it back.
>
> I ended up with the below, but I'm not sure I like it much. Let me prod
> a wee bit more to see if there's not something else we can do.
>
> Google has this patch-set replacing min_vruntime with an actual global
> 0-lag, which greatly simplifies things. If only they'd post it sometime
> :/ /me prods pjt and ben with a sharp stick :-)
>
> ---
>  kernel/sched/fair.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 986c10c25176..77566a340cbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -462,17 +462,23 @@ static inline int entity_before(struct sched_entity *a,
>
>  static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  {
> +       struct sched_entity *curr = cfs_rq->curr;
> +
>         u64 vruntime = cfs_rq->min_vruntime;
>
> -       if (cfs_rq->curr)
> -               vruntime = cfs_rq->curr->vruntime;
> +       if (curr) {
> +               if (curr->on_rq)
> +                       vruntime = curr->vruntime;
> +               else
> +                       curr = NULL;
> +       }
>
>         if (cfs_rq->rb_leftmost) {
>                 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
>                                                    struct sched_entity,
>                                                    run_node);
>
> -               if (!cfs_rq->curr)
> +               if (!curr)
>                         vruntime = se->vruntime;
>                 else
>                         vruntime = min_vruntime(vruntime, se->vruntime);
> @@ -3483,8 +3489,16 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>         /* return excess runtime on last dequeue */
>         return_cfs_rq_runtime(cfs_rq);
>
> -       update_min_vruntime(cfs_rq);
>         update_cfs_shares(cfs_rq);
> +
> +       /*
> +        * Now advance min_vruntime if @se was the entity holding it back,
> +        * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
> +        * put back on, and if we advance min_vruntime, we'll be placed back
> +        * further than we started -- ie. we'll be penalized.
> +        */
> +       if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
> +               update_min_vruntime(cfs_rq);
>  }
>
>  /*

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-20 11:54           ` Peter Zijlstra
  2016-09-20 13:06             ` Vincent Guittot
@ 2016-09-20 16:59             ` bsegall
  2016-09-22  8:33               ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: bsegall @ 2016-09-20 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Dietmar Eggemann,
	Paul Turner

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Sep 16, 2016 at 04:23:16PM +0200, Vincent Guittot wrote:
>> On 16 September 2016 at 14:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> >> > Also, the normalize comment in dequeue_entity() worries me, 'someone'
>> >> > didn't update that when he moved update_min_vruntime() around.
>> >
>> > I now worry more, so we do:
>> >
>> >         dequeue_task := dequeue_task_fair (p == current)
>> >           dequeue_entity
>> >             update_curr()
>> >               update_min_vruntime()
>> >             vruntime -= min_vruntime
>> >             update_min_vruntime()
>> >               // use cfs_rq->curr, which we just normalized !
>> 
>> yes but does it really change the cfs_rq->min_vruntime in this case ?
>
> So let me see; it does:
>
> 	vruntime = cfs_rq->min_vruntime;
>
> 	if (curr) // true
> 	  vruntime = curr->vruntime; // == vruntime - min_vruntime
>
> 	if (leftmost) // possible
> 	  if (curr) // true
> 	    vruntime = min_vruntime(vruntime, se->vruntime);
> 	      if (se->vruntime - (curr->vruntime - min_vruntime)) < 0 // false
>
> 	min_vruntime = max_vruntime(min_vruntime, vruntime);
> 	  if ((curr->vruntime - min_vruntime) - min_vruntime) > 0)
>
>
> The problem is that double subtraction of min_vruntime can wrap.
> The thing is, min_vruntime is the 0-point in our modular space, it
> normalizes vruntime (ideally min_vruntime would be our 0-lag point,
> resulting in vruntime - min_vruntime being the lag).
>
> The moment min_vruntime grows past S64_MAX/2 -2*min_vruntime wraps into
> positive space again and the test above becomes true and we'll select
> the normalized @curr vruntime as new min_vruntime and weird stuff will
> happen.
>
>
> Also, even it things magically worked out, its still very icky to mix
> the normalized vruntime into things.
>
>> >         put_prev_task := put_prev_task_fair
>> >           put_prev_entity
>> >             cfs_rq->curr = NULL;
>> >
>> >
>> > Now the point of the latter update_min_vruntime() is to advance
>> > min_vruntime when the task we removed was the one holding it back.
>> >
>> > However, it means that if we do dequeue+enqueue, we're further in the
>> > future (ie. we get penalized).
>> >
>> > So I'm inclined to simply remove the (2nd) update_min_vruntime() call.
>> > But as said above, my brain isn't co-operating much today.
>
> OK, so not sure we can actually remove it, we do want it to move
> min_vruntime forward (sometimes). We just don't want it to do so when
> DEQUEUE_SAVE -- we want to get back where we left off, nor do we want to
> muck about with touching normalized values.
>
> Another fun corner case is DEQUEUE_SLEEP; in that case we do not
> normalize, but we still want advance min_vruntime if this was the one
> holding it back.
>
> I ended up with the below, but I'm not sure I like it much. Let me prod
> a wee bit more to see if there's not something else we can do.
>
> Google has this patch-set replacing min_vruntime with an actual global
> 0-lag, which greatly simplifies things. If only they'd post it sometime
> :/ /me prods pjt and ben with a sharp stick :-)
>

No, we don't have any patches like that. I wish, we've screwed up
vruntime a couple of times too.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
  2016-09-12  7:47 ` [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
@ 2016-09-21 10:14   ` Dietmar Eggemann
  2016-09-21 12:34     ` Vincent Guittot
  0 siblings, 1 reply; 41+ messages in thread
From: Dietmar Eggemann @ 2016-09-21 10:14 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, yuyang.du,
	Morten.Rasmussen
  Cc: linaro-kernel, pjt, bsegall

Hi Vincent,

On 12/09/16 08:47, Vincent Guittot wrote:
> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
> a child will always be called before its parent.
> 
> The hierarchical order in shares update list has been introduced by
> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")
> 
> With the current implementation a child can be still put after its parent.
> 
> Lets take the example of
>        root
>         \
>          b
>          /\
>          c d*
>            |
>            e*
> 
> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
> looks like: head -> c -> b -> root -> tail
> 
> The branch d -> e will be added the first time that they are enqueued,
> starting with e then d.
> 
> When e is added, its parents is not already on the list so e is put at the
> tail : head -> c -> b -> root -> e -> tail
> 
> Then, d is added at the head because its parent is already on the list:
> head -> d -> c -> b -> root -> e -> tail
> 
> e is not placed at the right position and will be called the last whereas
> it should be called at the beginning.
> 
> Because it follows the bottom-up enqueue sequence, we are sure that we
> will finished to add either a cfs_rq without parent or a cfs_rq with a parent
> that is already on the list. We can use this event to detect when we have
> finished to add a new branch. For the others, whose parents are not already
> added, we have to ensure that they will be added after their children that
> have just been inserted the steps before, and after any potential parents that
> are already in the list. The easiest way is to put the cfs_rq just after the
> last inserted one and to keep track of it untl the branch is fully added.

[...]

I tested it with a multi-level task group hierarchy and it does the
right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

I wonder if this patch is related to the comment "TODO: fix up out-of-order
children on enqueue." in update_shares_cpu() of commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation")?

I guess in the meantime we lost the functionality to remove a cfs_rq from the
leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates
away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
owned by Cpu1 stay in the leaf_cfs_rq list.

Shouldn't we reintegrate it? Following patch goes on top of this patch:

-- >8 --

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Tue, 20 Sep 2016 17:30:09 +0100
Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
 update_blocked_averages()

There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
list in case there are no se's enqueued on it. 

The functionality had been initially introduced by commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
Rewrite runnable load and utilization average tracking").

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd530359ef84..951c83337e2b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)
                        update_tg_load_avg(cfs_rq, 0); 
 
                /* Propagate pending load changes to the parent */
-               if (cfs_rq->tg->se[cpu])
+               if (cfs_rq->tg->se[cpu]) {
                        update_load_avg(cfs_rq->tg->se[cpu], 0, 0); 
+
+                       if (!cfs_rq->nr_running)
+                               list_del_leaf_cfs_rq(cfs_rq);
+               }
        }
        raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
  2016-09-21 10:14   ` Dietmar Eggemann
@ 2016-09-21 12:34     ` Vincent Guittot
  2016-09-21 17:25       ` Dietmar Eggemann
  0 siblings, 1 reply; 41+ messages in thread
From: Vincent Guittot @ 2016-09-21 12:34 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Paul Turner,
	Benjamin Segall

Hi Dietmar,

On 21 September 2016 at 12:14, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> Hi Vincent,
>
> On 12/09/16 08:47, Vincent Guittot wrote:
>> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
>> a child will always be called before its parent.
>>
>> The hierarchical order in shares update list has been introduced by
>> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")
>>
>> With the current implementation a child can be still put after its parent.
>>
>> Lets take the example of
>>        root
>>         \
>>          b
>>          /\
>>          c d*
>>            |
>>            e*
>>
>> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
>> looks like: head -> c -> b -> root -> tail
>>
>> The branch d -> e will be added the first time that they are enqueued,
>> starting with e then d.
>>
>> When e is added, its parents is not already on the list so e is put at the
>> tail : head -> c -> b -> root -> e -> tail
>>
>> Then, d is added at the head because its parent is already on the list:
>> head -> d -> c -> b -> root -> e -> tail
>>
>> e is not placed at the right position and will be called the last whereas
>> it should be called at the beginning.
>>
>> Because it follows the bottom-up enqueue sequence, we are sure that we
>> will finished to add either a cfs_rq without parent or a cfs_rq with a parent
>> that is already on the list. We can use this event to detect when we have
>> finished to add a new branch. For the others, whose parents are not already
>> added, we have to ensure that they will be added after their children that
>> have just been inserted the steps before, and after any potential parents that
>> are already in the list. The easiest way is to put the cfs_rq just after the
>> last inserted one and to keep track of it untl the branch is fully added.
>
> [...]
>
> I tested it with a multi-level task group hierarchy and it does the
> right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
> in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

Thanks for testing

>
> I wonder if this patch is related to the comment "TODO: fix up out-of-order
> children on enqueue." in update_shares_cpu() of commit 82958366cfea
> ("sched: Replace update_shares weight distribution with per-entity
> computation")?

I had not that comment in mind when i have done this patch only to
ensure that the propagation of load and utilization in children will
be done before testing their parents
That being said, I agree that the comment probably points to the same
ordering issue than what this patch solves

>
> I guess in the meantime we lost the functionality to remove a cfs_rq from the
> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates
> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
> owned by Cpu1 stay in the leaf_cfs_rq list.
>
> Shouldn't we reintegrate it? Following patch goes on top of this patch:

I see one problem: Once a cfs_rq has been removed , it will not be
periodically updated anymore until the next enqueue. A sleeping task
is only attached but not enqueued when it moves into a cfs_rq so its
load is added to cfs_rq's load which have to be periodically
updated/decayed. So adding a cfs_rq to the list only during an enqueue
is not enough in this case.

Then, only the highest child level of task group will be removed
because cfs_rq->nr_running will be > 0 as soon as a child task group
is created and enqueued into a task group. Or you should use
cfs.h_nr_running instead i don't know all implications

Regards,
Vincent

>
> -- >8 --
>
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Tue, 20 Sep 2016 17:30:09 +0100
> Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
>  update_blocked_averages()
>
> There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
> list in case there are no se's enqueued on it.
>
> The functionality had been initially introduced by commit 82958366cfea
> ("sched: Replace update_shares weight distribution with per-entity
> computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
> Rewrite runnable load and utilization average tracking").
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd530359ef84..951c83337e2b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)
>                         update_tg_load_avg(cfs_rq, 0);
>
>                 /* Propagate pending load changes to the parent */
> -               if (cfs_rq->tg->se[cpu])
> +               if (cfs_rq->tg->se[cpu]) {
>                         update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
> +
> +                       if (!cfs_rq->nr_running)
> +                               list_del_leaf_cfs_rq(cfs_rq);
> +               }
>         }
>         raw_spin_unlock_irqrestore(&rq->lock, flags);
>  }
> --
> 1.9.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
  2016-09-21 12:34     ` Vincent Guittot
@ 2016-09-21 17:25       ` Dietmar Eggemann
  2016-09-21 18:02         ` Vincent Guittot
  0 siblings, 1 reply; 41+ messages in thread
From: Dietmar Eggemann @ 2016-09-21 17:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Paul Turner,
	Benjamin Segall

On 21/09/16 13:34, Vincent Guittot wrote:
> Hi Dietmar,
> 
> On 21 September 2016 at 12:14, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>> Hi Vincent,
>>
>> On 12/09/16 08:47, Vincent Guittot wrote:

[...]

>> I guess in the meantime we lost the functionality to remove a cfs_rq from the
>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates
>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
>> owned by Cpu1 stay in the leaf_cfs_rq list.
>>
>> Shouldn't we reintegrate it? Following patch goes on top of this patch:
> 
> I see one problem: Once a cfs_rq has been removed , it will not be
> periodically updated anymore until the next enqueue. A sleeping task
> is only attached but not enqueued when it moves into a cfs_rq so its
> load is added to cfs_rq's load which have to be periodically
> updated/decayed. So adding a cfs_rq to the list only during an enqueue
> is not enough in this case.

There was more in the original patch (commit 82958366cfea), the removal of the
cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
to make sure that these blocked contributions have been decayed before we do the
removal?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ac1718560e2..3595b0623b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)
         * list_add_leaf_cfs_rq() for details.
         */
        for_each_leaf_cfs_rq(rq, cfs_rq) {
+               struct sched_entity *se = cfs_rq->tg->se[cpu];
+
                /* throttled entities do not contribute to load */
                if (throttled_hierarchy(cfs_rq))
                        continue;
@@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)
                        update_tg_load_avg(cfs_rq, 0); 
 
                /* Propagate pending load changes to the parent */
-               if (cfs_rq->tg->se[cpu])
-                       update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
+               if (se) {
+                       update_load_avg(se, 0, 0);
+
+                       if (!se->avg.load_avg && !cfs_rq->nr_running)
+                               list_del_leaf_cfs_rq(cfs_rq);
+               }
        }
        raw_spin_unlock_irqrestore(&rq->lock, flags);
 }


> Then, only the highest child level of task group will be removed
> because cfs_rq->nr_running will be > 0 as soon as a child task group
> is created and enqueued into a task group. Or you should use
> cfs.h_nr_running instead i don't know all implications

In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something?

    migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2
    migration/1-16 [001] 67.235295: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097550d700 tg_id=6 on_list=0
    migration/1-16 [001] 67.235298: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800975903700 tg_id=4 on_list=0
    migration/1-16 [001] 67.235300: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800974e0fe00 tg_id=2 on_list=0
    migration/1-16 [001] 67.235302: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097fecb6e0 tg_id=1 on_list=1
    migration/1-16 [001] 67.235309: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1
 -> migration/1-16 [001] 67.235312: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1
    migration/1-16 [001] 67.235314: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1
 -> migration/1-16 [001] 67.235315: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1
    migration/1-16 [001] 67.235316: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1
 -> migration/1-16 [001] 67.235318: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1
    migration/1-16 [001] 67.235319: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097feb56e0 tg_id=1 on_list=1

If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists.

[...]

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
  2016-09-21 17:25       ` Dietmar Eggemann
@ 2016-09-21 18:02         ` Vincent Guittot
  0 siblings, 0 replies; 41+ messages in thread
From: Vincent Guittot @ 2016-09-21 18:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Paul Turner,
	Benjamin Segall

On 21 September 2016 at 19:25, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> On 21/09/16 13:34, Vincent Guittot wrote:
>> Hi Dietmar,
>>
>> On 21 September 2016 at 12:14, Dietmar Eggemann
>> <dietmar.eggemann@arm.com> wrote:
>>> Hi Vincent,
>>>
>>> On 12/09/16 08:47, Vincent Guittot wrote:
>
> [...]
>
>>> I guess in the meantime we lost the functionality to remove a cfs_rq from the
>>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates
>>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
>>> owned by Cpu1 stay in the leaf_cfs_rq list.
>>>
>>> Shouldn't we reintegrate it? Following patch goes on top of this patch:
>>
>> I see one problem: Once a cfs_rq has been removed , it will not be
>> periodically updated anymore until the next enqueue. A sleeping task
>> is only attached but not enqueued when it moves into a cfs_rq so its
>> load is added to cfs_rq's load which have to be periodically
>> updated/decayed. So adding a cfs_rq to the list only during an enqueue
>> is not enough in this case.
>
> There was more in the original patch (commit 82958366cfea), the removal of the
> cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
> decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
> to make sure that these blocked contributions have been decayed before we do the
> removal?

Yes we can use se->avg.load_avg but that's not enough. Once,
se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we
remove the cfs_rq  from the list and this cfs_rq will not be updated
until a new sched_entity is enqueued. But when you move a sleeping
task between 2 task groups, the load of the task is attached to the
destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no
more null but it will not be updated during update_blocked_averages
because the cfs_rq is no more in the list. So waiting for the enqueue
of the sched_entity to add the cfs_rq back in the list is no more
enough. You will have to do that during attach too

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4ac1718560e2..3595b0623b37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)
>          * list_add_leaf_cfs_rq() for details.
>          */
>         for_each_leaf_cfs_rq(rq, cfs_rq) {
> +               struct sched_entity *se = cfs_rq->tg->se[cpu];
> +
>                 /* throttled entities do not contribute to load */
>                 if (throttled_hierarchy(cfs_rq))
>                         continue;
> @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)
>                         update_tg_load_avg(cfs_rq, 0);
>
>                 /* Propagate pending load changes to the parent */
> -               if (cfs_rq->tg->se[cpu])
> -                       update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
> +               if (se) {
> +                       update_load_avg(se, 0, 0);
> +
> +                       if (!se->avg.load_avg && !cfs_rq->nr_running)
> +                               list_del_leaf_cfs_rq(cfs_rq);
> +               }
>         }
>         raw_spin_unlock_irqrestore(&rq->lock, flags);
>  }
>
>
>> Then, only the highest child level of task group will be removed
>> because cfs_rq->nr_running will be > 0 as soon as a child task group
>> is created and enqueued into a task group. Or you should use
>> cfs.h_nr_running instead i don't know all implications
>
> In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something?

no you don't miss anything. I have replied a bit too quickly on that
point;  the sched_entity of a tsk_group is dequeued when its group
cfs_rq becomes idle  so the parent cfs_rq->nr_running can be null

>
>     migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2
>     migration/1-16 [001] 67.235295: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097550d700 tg_id=6 on_list=0
>     migration/1-16 [001] 67.235298: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800975903700 tg_id=4 on_list=0
>     migration/1-16 [001] 67.235300: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff800974e0fe00 tg_id=2 on_list=0
>     migration/1-16 [001] 67.235302: bprint:             enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0xffff80097fecb6e0 tg_id=1 on_list=1
>     migration/1-16 [001] 67.235309: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1
>  -> migration/1-16 [001] 67.235312: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff80097550d800 tg_id=6 on_list=1
>     migration/1-16 [001] 67.235314: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1
>  -> migration/1-16 [001] 67.235315: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800975903600 tg_id=4 on_list=1
>     migration/1-16 [001] 67.235316: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1
>  -> migration/1-16 [001] 67.235318: bprint:             update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0xffff800974e0f600 tg_id=2 on_list=1
>     migration/1-16 [001] 67.235319: bprint:             update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0xffff80097feb56e0 tg_id=1 on_list=1
>
> If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists.

Yes i agree. The main drawback is probably that
update_blocked_averages  continue to loop on these cfs_rq during
periodic load balance. We have to compare the cost of adding a branch
of cfs_rqs into the list during the enqueue/attach of a sched_entity
with the cost of the useless loops on these cfs_rq without load during
periodic load balance.

>
> [...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-20 16:59             ` bsegall
@ 2016-09-22  8:33               ` Peter Zijlstra
  2016-09-22 17:10                 ` bsegall
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-22  8:33 UTC (permalink / raw)
  To: bsegall
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Dietmar Eggemann,
	Paul Turner

On Tue, Sep 20, 2016 at 09:59:08AM -0700, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > Google has this patch-set replacing min_vruntime with an actual global
> > 0-lag, which greatly simplifies things. If only they'd post it sometime
> > :/ /me prods pjt and ben with a sharp stick :-)
> >
> 
> No, we don't have any patches like that. I wish, we've screwed up
> vruntime a couple of times too.

Hurm, I was sure you guys were doing something entirely different from
what mainline does.

IIRC Andrew Hunter was 'working' on getting that posted. See also:
lkml.kernel.org/r/CADroS=6Ncpao6WWXBxboB6quqUP96EnjbsDUZqNgASd_PzDGaA@mail.gmail.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-20 13:06             ` Vincent Guittot
@ 2016-09-22 12:25               ` Peter Zijlstra
  2016-09-26 14:53                 ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-22 12:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On Tue, Sep 20, 2016 at 03:06:04PM +0200, Vincent Guittot wrote:
> > Also, even it things magically worked out, its still very icky to mix
> > the normalized vruntime into things.
> 
> I agree

In any case, I pushed out a bunch of patches to:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/tmp

they appear to be able to build a kernel, but that's not much testing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-22  8:33               ` Peter Zijlstra
@ 2016-09-22 17:10                 ` bsegall
  0 siblings, 0 replies; 41+ messages in thread
From: bsegall @ 2016-09-22 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Linaro Kernel Mailman List, Dietmar Eggemann,
	Paul Turner

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 20, 2016 at 09:59:08AM -0700, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> > Google has this patch-set replacing min_vruntime with an actual global
>> > 0-lag, which greatly simplifies things. If only they'd post it sometime
>> > :/ /me prods pjt and ben with a sharp stick :-)
>> >
>> 
>> No, we don't have any patches like that. I wish, we've screwed up
>> vruntime a couple of times too.
>
> Hurm, I was sure you guys were doing something entirely different from
> what mainline does.
>
> IIRC Andrew Hunter was 'working' on getting that posted. See also:
> lkml.kernel.org/r/CADroS=6Ncpao6WWXBxboB6quqUP96EnjbsDUZqNgASd_PzDGaA@mail.gmail.com

Oh, I was looking at the wrong branches. Sigh.

Yeah, we just do an unlocked "se->vruntime -= old->min_vruntime -
new->min_vruntime;" in migrate, and remove/add it in
switched_from/switched_to_fair.

This still isn't replacing min_vruntime with anything cool, perhaps you
and Andrew were using opposite definitions of relative here - we have
fair tasks always having a vruntime that is relative to min_vruntime,
rather than temporarily having the 0-based one when doing
non-SLEEP/WAKEUP.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class
  2016-09-22 12:25               ` Peter Zijlstra
@ 2016-09-26 14:53                 ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-26 14:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Morten Rasmussen,
	Linaro Kernel Mailman List, Dietmar Eggemann, Paul Turner,
	Benjamin Segall

On Thu, Sep 22, 2016 at 02:25:18PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 20, 2016 at 03:06:04PM +0200, Vincent Guittot wrote:
> > > Also, even it things magically worked out, its still very icky to mix
> > > the normalized vruntime into things.
> > 
> > I agree
> 
> In any case, I pushed out a bunch of patches to:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/tmp
> 
> they appear to be able to build a kernel, but that's not much testing.

Hurmph, its buggered for cgroups. When we get the newidle lockbreak
between dequeue and put_prev_task, the load-balance pull can add a task
to the 'current' cgroup.

Which suggests we cannot do that enqueue_entity() optimization, unless
we fix that dequeue+put atomicy muck.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [tip:sched/core] sched/core: Fix incorrect utilization accounting when switching to fair class
  2016-09-12  7:47 ` [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class Vincent Guittot
  2016-09-15 13:18   ` Peter Zijlstra
  2016-09-16 10:51   ` Peter Zijlstra
@ 2016-09-30 12:01   ` tip-bot for Vincent Guittot
  2 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-09-30 12:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, vincent.guittot, efault, peterz, linux-kernel,
	mingo, tglx

Commit-ID:  a399d233078edbba7cf7902a6d080100cdf75636
Gitweb:     http://git.kernel.org/tip/a399d233078edbba7cf7902a6d080100cdf75636
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Mon, 12 Sep 2016 09:47:52 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 30 Sep 2016 11:03:27 +0200

sched/core: Fix incorrect utilization accounting when switching to fair class

When a task switches to fair scheduling class, the period between now
and the last update of its utilization is accounted as running time
whatever happened during this period. This incorrect accounting applies
to the task and also to the task group branch.

When changing the property of a running task like its list of allowed
CPUs or its scheduling class, we follow the sequence:

 - dequeue task
 - put task
 - change the property
 - set task as current task
 - enqueue task

The end of the sequence doesn't follow the normal sequence (as per
__schedule()) which is:

 - enqueue a task
 - then set the task as current task.

This incorrectordering is the root cause of incorrect utilization accounting.
Update the sequence to follow the right one:

 - dequeue task
 - put task
 - change the property
 - enqueue task
 - set task as current task

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: linaro-kernel@lists.linaro.org
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9411545..2b5e150 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1109,10 +1109,10 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 
 	p->sched_class->set_cpus_allowed(p, new_mask);
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 }
 
 /*
@@ -3707,10 +3707,10 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
 	p->prio = prio;
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -4263,8 +4263,6 @@ change:
 	prev_class = p->sched_class;
 	__setscheduler(rq, p, attr, pi);
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued) {
 		/*
 		 * We enqueue to tail when the priority of a task is
@@ -4275,6 +4273,8 @@ change:
 
 		enqueue_task(rq, p, queue_flags);
 	}
+	if (running)
+		p->sched_class->set_curr_task(rq);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 	preempt_disable(); /* avoid rq from going away on us */
@@ -5439,10 +5439,10 @@ void sched_setnuma(struct task_struct *p, int nid)
 
 	p->numa_preferred_nid = nid;
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -7949,10 +7949,10 @@ void sched_move_task(struct task_struct *tsk)
 
 	sched_change_group(tsk, TASK_MOVE_GROUP);
 
-	if (unlikely(running))
-		tsk->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE);
+	if (unlikely(running))
+		tsk->sched_class->set_curr_task(rq);
 
 	task_rq_unlock(rq, tsk, &rf);
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-09-30 12:02 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-12  7:47 [PATCH 0/7 v3] sched: reflect sched_entity move into task_group's load Vincent Guittot
2016-09-12  7:47 ` [PATCH 1/7 v3] sched: factorize attach entity Vincent Guittot
2016-09-12  7:47 ` [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
2016-09-21 10:14   ` Dietmar Eggemann
2016-09-21 12:34     ` Vincent Guittot
2016-09-21 17:25       ` Dietmar Eggemann
2016-09-21 18:02         ` Vincent Guittot
2016-09-12  7:47 ` [PATCH 3/7 v3] sched: factorize PELT update Vincent Guittot
2016-09-15 13:09   ` Peter Zijlstra
2016-09-15 13:30     ` Vincent Guittot
2016-09-12  7:47 ` [PATCH 4/7 v3] sched: propagate load during synchronous attach/detach Vincent Guittot
2016-09-15 12:55   ` Peter Zijlstra
2016-09-15 13:01     ` Vincent Guittot
2016-09-15 12:59   ` Peter Zijlstra
2016-09-15 13:11     ` Vincent Guittot
2016-09-15 13:11   ` Dietmar Eggemann
2016-09-15 14:31     ` Vincent Guittot
2016-09-15 17:20       ` Dietmar Eggemann
2016-09-15 15:14     ` Peter Zijlstra
2016-09-15 17:36       ` Dietmar Eggemann
2016-09-15 17:54         ` Peter Zijlstra
2016-09-15 14:43   ` Peter Zijlstra
2016-09-15 14:51     ` Vincent Guittot
2016-09-19  3:19   ` Wanpeng Li
2016-09-12  7:47 ` [PATCH 5/7 v3] sched: propagate asynchrous detach Vincent Guittot
2016-09-12  7:47 ` [PATCH 6/7 v3] sched: fix task group initialization Vincent Guittot
2016-09-12  7:47 ` [PATCH 7/7 v3] sched: fix wrong utilization accounting when switching to fair class Vincent Guittot
2016-09-15 13:18   ` Peter Zijlstra
2016-09-15 15:36     ` Vincent Guittot
2016-09-16 12:16       ` Peter Zijlstra
2016-09-16 14:23         ` Vincent Guittot
2016-09-20 11:54           ` Peter Zijlstra
2016-09-20 13:06             ` Vincent Guittot
2016-09-22 12:25               ` Peter Zijlstra
2016-09-26 14:53                 ` Peter Zijlstra
2016-09-20 16:59             ` bsegall
2016-09-22  8:33               ` Peter Zijlstra
2016-09-22 17:10                 ` bsegall
2016-09-16 10:51   ` Peter Zijlstra
2016-09-16 12:45     ` Vincent Guittot
2016-09-30 12:01   ` [tip:sched/core] sched/core: Fix incorrect " tip-bot for Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.