linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load
@ 2016-11-08  9:53 Vincent Guittot
  2016-11-08  9:53 ` [PATCH 1/6 v7] sched: factorize attach/detach entity Vincent Guittot
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

Ensure that the move of a sched_entity will be reflected in load and
utilization of the task_group hierarchy.

When a sched_entity moves between groups or CPUs, load and utilization
of cfs_rq don't reflect the changes immediately but converge to new values.
As a result, the metrics are no more aligned with the new balance of the
load in the system and next decisions will have a biased view.

This patchset synchronizes load/utilization of sched_entity with its child
cfs_rq (se->my-q) only when tasks move to/from child cfs_rq:
-move between task group
-migration between CPUs
Otherwise, PELT is updated as usual.

This version doesn't include any changes related to discussion that have
started during the review of the previous version about:
- encapsulate the sequence for changing the property of a task
- remove a cfs_rq from list during update_blocked_averages  
These topics don't gain anything from being added in this patchset as they
are fairly independent and deserve a separate patch.

Changes since v6:
-fix warning and error raised by lkp

Changes since v5:
- factorize detach entity like for attach
- fix add_positive
- Fixed few coding style

Changes since v4:
- minor typo and commit message changes
- move call to cfs_rq_clock_task(cfs_rq) in post_init_entity_util_avg

Changes since v3:
- Replaced the 2 arguments of update_load_avg by 1 flags argument
- Propagated move in runnable_load_avg when sched_entity is already on_rq
- Ensure that intermediate value will not reach memory when updating load and
  utilization
- Optimize the the calculation of load_avg of the sched_entity
- Fixed some typo

Changes since v2:
- Propagate both utilization and load
- Synced sched_entity and se->my_q instead of adding the delta

Changes since v1:
- This patch needs the patch that fixes issue with rq->leaf_cfs_rq_list
  "sched: fix hierarchical order in rq->leaf_cfs_rq_list" in order to work
  correctly. I haven't sent them as a single patchset because the fix is
  independent of this one
- Merge some functions that are always used together
- During update of blocked load, ensure that the sched_entity is synced
  with the cfs_rq applying changes
- Fix an issue when task changes its cpu affinity

Vincent Guittot (6):
  sched: factorize attach/detach entity
  sched: fix hierarchical order in rq->leaf_cfs_rq_list
  sched: factorize PELT update
  sched: propagate load during synchronous attach/detach
  sched: propagate asynchrous detach
  sched: fix task group initialization

 kernel/sched/core.c  |   1 +
 kernel/sched/fair.c  | 395 ++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h |   2 +
 3 files changed, 318 insertions(+), 80 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6 v7] sched: factorize attach/detach entity
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
@ 2016-11-08  9:53 ` Vincent Guittot
  2016-11-16 12:15   ` [tip:sched/core] sched/fair: Factorize " tip-bot for Vincent Guittot
  2016-11-08  9:53 ` [PATCH 2/6 v7] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

Factorize post_init_entity_util_avg and part of attach_task_cfs_rq
in one function attach_entity_cfs_rq.
Create symmetric detach_entity_cfs_rq function

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c242944..b27cac0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -708,9 +708,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
-static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static void attach_entity_cfs_rq(struct sched_entity *se);
 
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
@@ -742,7 +740,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
 	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -770,14 +767,12 @@ void post_init_entity_util_avg(struct sched_entity *se)
 			 * such that the next switched_to_fair() has the
 			 * expected state.
 			 */
-			se->avg.last_update_time = now;
+			se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
 			return;
 		}
 	}
 
-	update_cfs_rq_load_avg(now, cfs_rq, false);
-	attach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq, false);
+	attach_entity_cfs_rq(se);
 }
 
 #else /* !CONFIG_SMP */
@@ -8687,30 +8682,19 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	return false;
 }
 
-static void detach_task_cfs_rq(struct task_struct *p)
+static void detach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 
-	if (!vruntime_normalized(p)) {
-		/*
-		 * Fix up our vruntime so that the current sleep doesn't
-		 * cause 'unlimited' sleep bonus.
-		 */
-		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
-	}
-
 	/* Catch up with the cfs_rq and remove our load when we leave */
 	update_cfs_rq_load_avg(now, cfs_rq, false);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
 
-static void attach_task_cfs_rq(struct task_struct *p)
+static void attach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 
@@ -8722,10 +8706,35 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	se->depth = se->parent ? se->parent->depth + 1 : 0;
 #endif
 
-	/* Synchronize task with its cfs_rq */
+	/* Synchronize entity with its cfs_rq */
 	update_cfs_rq_load_avg(now, cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+}
+
+static void detach_task_cfs_rq(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (!vruntime_normalized(p)) {
+		/*
+		 * Fix up our vruntime so that the current sleep doesn't
+		 * cause 'unlimited' sleep bonus.
+		 */
+		place_entity(cfs_rq, se, 0);
+		se->vruntime -= cfs_rq->min_vruntime;
+	}
+
+	detach_entity_cfs_rq(se);
+}
+
+static void attach_task_cfs_rq(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	attach_entity_cfs_rq(se);
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6 v7] sched: fix hierarchical order in rq->leaf_cfs_rq_list
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
  2016-11-08  9:53 ` [PATCH 1/6 v7] sched: factorize attach/detach entity Vincent Guittot
@ 2016-11-08  9:53 ` Vincent Guittot
  2016-11-16 12:15   ` [tip:sched/core] sched/fair: Fix " tip-bot for Vincent Guittot
  2016-11-08  9:53 ` [PATCH 3/6 v7] sched: factorize PELT update Vincent Guittot
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
a child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its parent.

Lets take the example of
       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at the
tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the list:
head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last whereas
it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a parent
that is already on the list. We can use this event to detect when we have
finished to add a new branch. For the others, whose parents are not already
added, we have to ensure that they will be added after their children that
have just been inserted the steps before, and after any potential parents that
are already in the list. The easiest way is to put the cfs_rq just after the
last inserted one and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 54 +++++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  1 +
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 154fd68..5c9d59b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7602,6 +7602,7 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
 		/*
 		 * How much cpu bandwidth does root_task_group get?
 		 *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b27cac0..bc5949d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -290,19 +290,59 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+		int cpu = cpu_of(rq);
 		/*
 		 * Ensure we either appear before our parent (if already
 		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
+		 * enqueued. The fact that we always enqueue bottom-up
+		 * reduces this to two cases and a special case for the root
+		 * cfs_rq. Furthermore, it also means that we will always reset
+		 * tmp_alone_branch either when the branch is connected
+		 * to a tree or when we reach the beg of the tree
 		 */
 		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
+		    cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
+			/*
+			 * If parent is already on the list, we add the child
+			 * just before. Thanks to circular linked property of
+			 * the list, this means to put the child at the tail
+			 * of the list that starts by parent.
+			 */
+			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+			/*
+			 * The branch is now connected to its tree so we can
+			 * reset tmp_alone_branch to the beginning of the
+			 * list.
+			 */
+			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		} else if (!cfs_rq->tg->parent) {
+			/*
+			 * cfs rq without parent should be put
+			 * at the tail of the list.
+			 */
 			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+				&rq->leaf_cfs_rq_list);
+			/*
+			 * We have reach the beg of a tree so we can reset
+			 * tmp_alone_branch to the beginning of the list.
+			 */
+			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		} else {
+			/*
+			 * The parent has not already been added so we want to
+			 * make sure that it will be put after us.
+			 * tmp_alone_branch points to the beg of the branch
+			 * where we will add parent.
+			 */
+			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				rq->tmp_alone_branch);
+			/*
+			 * update tmp_alone_branch to points to the new beg
+			 * of the branch
+			 */
+			rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
 		}
 
 		cfs_rq->on_list = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..2646244 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -623,6 +623,7 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
+	struct list_head *tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6 v7] sched: factorize PELT update
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
  2016-11-08  9:53 ` [PATCH 1/6 v7] sched: factorize attach/detach entity Vincent Guittot
  2016-11-08  9:53 ` [PATCH 2/6 v7] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
@ 2016-11-08  9:53 ` Vincent Guittot
  2016-11-16 12:16   ` [tip:sched/core] sched/fair: Factorize " tip-bot for Vincent Guittot
  2016-11-08  9:53 ` [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach Vincent Guittot
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

Every time, we modify load/utilization of sched_entity, we start to sync
it with its cfs_rq. This update is done is different ways:
-when attaching/detaching a sched_entity, we update cfs_rq and then we
sync the entity with the cfs_rq.
-when enqueueing/dequeuing the sched_entity, we update both sched_entity
and cfs_rq metrics to now.

Use update_load_avg everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 76 ++++++++++++++++++-----------------------------------
 1 file changed, 25 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc5949d..f18e42e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3099,8 +3099,14 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return decayed || removed_load;
 }
 
+/*
+ * Optional action to be done while updating the load average
+ */
+#define UPDATE_TG	0x1
+#define SKIP_AGE_LOAD	0x2
+
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_load_avg(struct sched_entity *se, int flags)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3111,11 +3117,13 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
+	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
+		__update_load_avg(now, cpu, &se->avg,
 			  se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
+	}
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
+	if (update_cfs_rq_load_avg(now, cfs_rq, true) && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);
 }
 
@@ -3129,26 +3137,6 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
  */
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		goto skip_aging;
-
-	/*
-	 * If we got migrated (either between CPUs or between cgroups) we'll
-	 * have aged the average right before clearing @last_update_time.
-	 *
-	 * Or we're fresh through post_init_entity_util_avg().
-	 */
-	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
-
-		/*
-		 * XXX: we could have just aged the entire load away if we've been
-		 * absent from the fair class for too long.
-		 */
-	}
-
-skip_aging:
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se->avg.load_sum;
@@ -3168,9 +3156,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
  */
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
@@ -3185,34 +3170,20 @@ static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	int migrated, decayed;
-
-	migrated = !sa->last_update_time;
-	if (!migrated) {
-		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
-			cfs_rq->curr == se, NULL);
-	}
-
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, !migrated);
 
 	cfs_rq->runnable_load_avg += sa->load_avg;
 	cfs_rq->runnable_load_sum += sa->load_sum;
 
-	if (migrated)
+	if (!sa->last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
-
-	if (decayed || migrated)
 		update_tg_load_avg(cfs_rq, 0);
+	}
 }
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
-
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
 	cfs_rq->runnable_load_sum =
@@ -3286,7 +3257,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return 0;
 }
 
-static inline void update_load_avg(struct sched_entity *se, int not_used)
+#define UPDATE_TG	0x0
+#define SKIP_AGE_LOAD	0x0
+
+static inline void update_load_avg(struct sched_entity *se, int not_used1)
 {
 	cpufreq_update_util(rq_of(cfs_rq_of(se)), 0);
 }
@@ -3431,6 +3405,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
+	update_load_avg(se, UPDATE_TG);
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -3505,6 +3480,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_load_avg(se, UPDATE_TG);
 	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -3592,7 +3568,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_load_avg(se, UPDATE_TG);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3710,7 +3686,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_load_avg(curr, UPDATE_TG);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4607,7 +4583,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, UPDATE_TG);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4666,7 +4642,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, UPDATE_TG);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -8725,10 +8701,9 @@ static inline bool vruntime_normalized(struct task_struct *p)
 static void detach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	update_load_avg(se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
@@ -8736,7 +8711,6 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
 static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -8747,7 +8721,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 #endif
 
 	/* Synchronize entity with its cfs_rq */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (2 preceding siblings ...)
  2016-11-08  9:53 ` [PATCH 3/6 v7] sched: factorize PELT update Vincent Guittot
@ 2016-11-08  9:53 ` Vincent Guittot
  2016-11-09 15:03   ` Peter Zijlstra
  2016-11-16 12:16   ` [tip:sched/core] sched/fair: Propagate " tip-bot for Vincent Guittot
  2016-11-08  9:53 ` [PATCH 5/6 v7] sched: propagate asynchrous detach Vincent Guittot
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   1 +
 2 files changed, 208 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f18e42e..e47bb046 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3032,6 +3032,165 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 }
 
 /*
+ * Signed add and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define add_positive(_ptr, _val) do {                           \
+	typeof(_ptr) ptr = (_ptr);                              \
+	typeof(_val) val = (_val);                              \
+	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
+								\
+	res = var + val;                                        \
+								\
+	if (val < 0 && res > var)                               \
+		res = 0;                                        \
+								\
+	WRITE_ONCE(*ptr, res);                                  \
+} while (0)
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Take into account change of utilization of a child task group */
+static inline void
+update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq =  group_cfs_rq(se);
+	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's utilization */
+	se->avg.util_avg = gcfs_rq->avg.util_avg;
+	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq utilization */
+	add_positive(&cfs_rq->avg.util_avg, delta);
+	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
+}
+
+/* Take into account change of load of a child task group */
+static inline void
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long delta, load = gcfs_rq->avg.load_avg;
+
+	/*
+	 * If the load of group cfs_rq is null, the load of the
+	 * sched_entity will also be null so we can skip the formula
+	 */
+	if (load) {
+		long tg_load;
+
+		/* Get tg's load and ensure tg_load > 0 */
+		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
+
+		/* Ensure tg_load >= load and updated with current load*/
+		tg_load -= gcfs_rq->tg_load_avg_contrib;
+		tg_load += load;
+
+		/*
+		 * We need to compute a correction term in the case that the
+		 * task group is consuming more cpu than a task of equal
+		 * weight. A task with a weight equals to tg->shares will have
+		 * a load less or equal to scale_load_down(tg->shares).
+		 * Similarly, the sched_entities that represent the task group
+		 * at parent level, can't have a load higher than
+		 * scale_load_down(tg->shares). And the Sum of sched_entities'
+		 * load must be <= scale_load_down(tg->shares).
+		 */
+		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
+			/* scale gcfs_rq's load into tg's shares*/
+			load *= scale_load_down(gcfs_rq->tg->shares);
+			load /= tg_load;
+		}
+	}
+
+	delta = load - se->avg.load_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's load */
+	se->avg.load_avg = load;
+	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq load */
+	add_positive(&cfs_rq->avg.load_avg, delta);
+	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+
+	/*
+	 * If the sched_entity is already enqueued, we also have to update the
+	 * runnable load avg.
+	 */
+	if (se->on_rq) {
+		/* Update parent cfs_rq runnable_load_avg */
+		add_positive(&cfs_rq->runnable_load_avg, delta);
+		cfs_rq->runnable_load_sum = cfs_rq->runnable_load_avg * LOAD_AVG_MAX;
+	}
+}
+
+static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
+{
+	/* set cfs_rq's flag */
+	cfs_rq->propagate_avg = 1;
+}
+
+static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
+{
+	/* Get my cfs_rq */
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
+
+	/* Nothing to propagate */
+	if (!cfs_rq->propagate_avg)
+		return 0;
+
+	/* Clear my cfs_rq's flag */
+	cfs_rq->propagate_avg = 0;
+
+	return 1;
+}
+
+/* Update task and its cfs_rq load average */
+static inline int propagate_entity_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	if (entity_is_task(se))
+		return 0;
+
+	if (!test_and_clear_tg_cfs_propagate(se))
+		return 0;
+
+	/* Get parent cfs_rq */
+	cfs_rq = cfs_rq_of(se);
+
+	/* Propagate to parent */
+	set_tg_cfs_propagate(cfs_rq);
+
+	/* Update utilization */
+	update_tg_cfs_util(cfs_rq, se);
+
+	/* Update load */
+	update_tg_cfs_load(cfs_rq, se);
+
+	return 1;
+}
+#else
+static inline int propagate_entity_load_avg(struct sched_entity *se)
+{
+	return 0;
+}
+
+static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
+#endif
+
+/*
  * Unsigned subtract and clamp on underflow.
  *
  * Explicitly do a load-store to ensure the intermediate value never hits
@@ -3112,6 +3271,7 @@ static inline void update_load_avg(struct sched_entity *se, int flags)
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
+	int decayed;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -3123,7 +3283,11 @@ static inline void update_load_avg(struct sched_entity *se, int flags)
 			  cfs_rq->curr == se, NULL);
 	}
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && (flags & UPDATE_TG))
+	decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
+
+	decayed |= propagate_entity_load_avg(se);
+
+	if (decayed && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);
 }
 
@@ -3142,6 +3306,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq->avg.load_sum += se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	set_tg_cfs_propagate(cfs_rq);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3161,6 +3326,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	set_tg_cfs_propagate(cfs_rq);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -8698,6 +8864,31 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	return false;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Propagate the changes of the sched_entity across the tg tree to make it
+ * visible to the root
+ */
+static void propagate_entity_cfs_rq(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	/* Start to propagate at parent */
+	se = se->parent;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, UPDATE_TG);
+	}
+}
+#else
+static void propagate_entity_cfs_rq(struct sched_entity *se) { }
+#endif
+
 static void detach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -8706,6 +8897,12 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
 	update_load_avg(se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+
+	/*
+	 * Propagate the detach across the tg tree to make it visible to the
+	 * root
+	 */
+	propagate_entity_cfs_rq(se);
 }
 
 static void attach_entity_cfs_rq(struct sched_entity *se)
@@ -8724,6 +8921,12 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+
+	/*
+	 * Propagate the attach across the tg tree to make it visible to the
+	 * root
+	 */
+	propagate_entity_cfs_rq(se);
 }
 
 static void detach_task_cfs_rq(struct task_struct *p)
@@ -8802,6 +9005,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
 #ifdef CONFIG_SMP
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	cfs_rq->propagate_avg = 0;
+#endif
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
 	atomic_long_set(&cfs_rq->removed_util_avg, 0);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2646244..a9c7527 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -404,6 +404,7 @@ struct cfs_rq {
 	unsigned long runnable_load_avg;
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
+	unsigned long propagate_avg;
 #endif
 	atomic_long_t removed_load_avg, removed_util_avg;
 #ifndef CONFIG_64BIT
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6 v7] sched: propagate asynchrous detach
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (3 preceding siblings ...)
  2016-11-08  9:53 ` [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach Vincent Guittot
@ 2016-11-08  9:53 ` Vincent Guittot
  2016-11-16 12:17   ` [tip:sched/core] sched/fair: Propagate " tip-bot for Vincent Guittot
  2016-11-08  9:53 ` [PATCH 6/6 v7] sched: fix task group initialization Vincent Guittot
  2016-11-10 17:04 ` [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Dietmar Eggemann
  6 siblings, 1 reply; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set propagation
flag.

During the load balance, we take advantage of the update of blocked load
to propagate any pending changes.The propagation relies on patch
"sched: fix hierarchical order in rq->leaf_cfs_rq_list", which orders
children and parents, to ensure that it's done in one pass.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e47bb046..8abed16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3235,6 +3235,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		sub_positive(&sa->load_avg, r);
 		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
+		set_tg_cfs_propagate(cfs_rq);
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
@@ -3242,6 +3243,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
+		set_tg_cfs_propagate(cfs_rq);
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -6818,6 +6820,10 @@ static void update_blocked_averages(int cpu)
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
+
+		/* Propagate pending load changes to the parent */
+		if (cfs_rq->tg->se[cpu])
+			update_load_avg(cfs_rq->tg->se[cpu], 0);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6 v7] sched: fix task group initialization
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (4 preceding siblings ...)
  2016-11-08  9:53 ` [PATCH 5/6 v7] sched: propagate asynchrous detach Vincent Guittot
@ 2016-11-08  9:53 ` Vincent Guittot
  2016-11-16 12:17   ` [tip:sched/core] sched/fair: Fix " tip-bot for Vincent Guittot
  2016-11-10 17:04 ` [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Dietmar Eggemann
  6 siblings, 1 reply; 17+ messages in thread
From: Vincent Guittot @ 2016-11-08  9:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, dietmar.eggemann
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp, Vincent Guittot

The moves of tasks are now propagated down to root and the utilization
of cfs_rq reflects reality so it doesn't need to be estimated at init.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8abed16..89539d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9122,7 +9122,7 @@ void online_fair_sched_group(struct task_group *tg)
 		se = tg->se[i];
 
 		raw_spin_lock_irq(&rq->lock);
-		post_init_entity_util_avg(se);
+		attach_entity_cfs_rq(se);
 		sync_throttle(tg, i);
 		raw_spin_unlock_irq(&rq->lock);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach
  2016-11-08  9:53 ` [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach Vincent Guittot
@ 2016-11-09 15:03   ` Peter Zijlstra
  2016-11-09 15:23     ` Vincent Guittot
  2016-11-16 12:16   ` [tip:sched/core] sched/fair: Propagate " tip-bot for Vincent Guittot
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2016-11-09 15:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, dietmar.eggemann, yuyang.du,
	Morten.Rasmussen, pjt, bsegall, kernellwp

On Tue, Nov 08, 2016 at 10:53:45AM +0100, Vincent Guittot wrote:
> When a task moves from/to a cfs_rq, we set a flag which is then used to
> propagate the change at parent level (sched_entity and cfs_rq) during
> next update. If the cfs_rq is throttled, the flag will stay pending until
> the cfs_rq is unthrottled.
> 
> For propagating the utilization, we copy the utilization of group cfs_rq to
> the sched_entity.
> 
> For propagating the load, we have to take into account the load of the
> whole task group in order to evaluate the load of the sched_entity.
> Similarly to what was done before the rewrite of PELT, we add a correction
> factor in case the task group's load is greater than its share so it will
> contribute the same load of a task of equal weight.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---


I did the below on top, that basically moves code about a bit to reduce
some #ifdef and kills a few comments that I thought were of the:

	i++; /* increment by one */

quality.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2918,6 +2918,26 @@ __update_load_avg(u64 now, int cpu, stru
 	return decayed;
 }
 
+/*
+ * Signed add and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define add_positive(_ptr, _val) do {                           \
+	typeof(_ptr) ptr = (_ptr);                              \
+	typeof(_val) val = (_val);                              \
+	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
+								\
+	res = var + val;                                        \
+								\
+	if (val < 0 && res > var)                               \
+		res = 0;                                        \
+								\
+	WRITE_ONCE(*ptr, res);                                  \
+} while (0)
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /**
  * update_tg_load_avg - update the tg's load avg
@@ -2997,59 +3017,12 @@ void set_task_rq_fair(struct sched_entit
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
-#else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
-#endif /* CONFIG_FAIR_GROUP_SCHED */
 
-static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
-{
-	if (&this_rq()->cfs == cfs_rq) {
-		/*
-		 * There are a few boundary cases this might miss but it should
-		 * get called often enough that that should (hopefully) not be
-		 * a real problem -- added to that it only calls on the local
-		 * CPU, so if we enqueue remotely we'll miss an update, but
-		 * the next tick/schedule should update.
-		 *
-		 * It will not get called when we go idle, because the idle
-		 * thread is a different class (!fair), nor will the utilization
-		 * number include things like RT tasks.
-		 *
-		 * As is, the util number is not freq-invariant (we'd have to
-		 * implement arch_scale_freq_capacity() for that).
-		 *
-		 * See cpu_util().
-		 */
-		cpufreq_update_util(rq_of(cfs_rq), 0);
-	}
-}
-
-/*
- * Signed add and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define add_positive(_ptr, _val) do {                           \
-	typeof(_ptr) ptr = (_ptr);                              \
-	typeof(_val) val = (_val);                              \
-	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
-								\
-	res = var + val;                                        \
-								\
-	if (val < 0 && res > var)                               \
-		res = 0;                                        \
-								\
-	WRITE_ONCE(*ptr, res);                                  \
-} while (0)
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
 /* Take into account change of utilization of a child task group */
 static inline void
 update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct cfs_rq *gcfs_rq =  group_cfs_rq(se);
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
 
 	/* Nothing to update */
@@ -3130,22 +3103,17 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 
 static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
 {
-	/* set cfs_rq's flag */
 	cfs_rq->propagate_avg = 1;
 }
 
 static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
 {
-	/* Get my cfs_rq */
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
 
-	/* Nothing to propagate */
 	if (!cfs_rq->propagate_avg)
 		return 0;
 
-	/* Clear my cfs_rq's flag */
 	cfs_rq->propagate_avg = 0;
-
 	return 1;
 }
 
@@ -3160,28 +3128,51 @@ static inline int propagate_entity_load_
 	if (!test_and_clear_tg_cfs_propagate(se))
 		return 0;
 
-	/* Get parent cfs_rq */
 	cfs_rq = cfs_rq_of(se);
 
-	/* Propagate to parent */
 	set_tg_cfs_propagate(cfs_rq);
 
-	/* Update utilization */
 	update_tg_cfs_util(cfs_rq, se);
-
-	/* Update load */
 	update_tg_cfs_load(cfs_rq, se);
 
 	return 1;
 }
-#else
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
+
 static inline int propagate_entity_load_avg(struct sched_entity *se)
 {
 	return 0;
 }
 
 static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
-#endif
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
+{
+	if (&this_rq()->cfs == cfs_rq) {
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll miss an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq-invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(rq_of(cfs_rq), 0);
+	}
+}
 
 /*
  * Unsigned subtract and clamp on underflow.
@@ -3276,8 +3267,7 @@ static inline void update_load_avg(struc
 			  cfs_rq->curr == se, NULL);
 	}
 
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
-
+	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
 	decayed |= propagate_entity_load_avg(se);
 
 	if (decayed && (flags & UPDATE_TG))
@@ -8993,11 +8983,6 @@ static void detach_entity_cfs_rq(struct
 	update_load_avg(se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
-
-	/*
-	 * Propagate the detach across the tg tree to make it visible to the
-	 * root
-	 */
 	propagate_entity_cfs_rq(se);
 }
 
@@ -9017,11 +9002,6 @@ static void attach_entity_cfs_rq(struct
 	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
-
-	/*
-	 * Propagate the attach across the tg tree to make it visible to the
-	 * root
-	 */
 	propagate_entity_cfs_rq(se);
 }
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach
  2016-11-09 15:03   ` Peter Zijlstra
@ 2016-11-09 15:23     ` Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: Vincent Guittot @ 2016-11-09 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Yuyang Du,
	Morten Rasmussen, Paul Turner, Ben Segall, Wanpeng Li

On 9 November 2016 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Nov 08, 2016 at 10:53:45AM +0100, Vincent Guittot wrote:
>> When a task moves from/to a cfs_rq, we set a flag which is then used to
>> propagate the change at parent level (sched_entity and cfs_rq) during
>> next update. If the cfs_rq is throttled, the flag will stay pending until
>> the cfs_rq is unthrottled.
>>
>> For propagating the utilization, we copy the utilization of group cfs_rq to
>> the sched_entity.
>>
>> For propagating the load, we have to take into account the load of the
>> whole task group in order to evaluate the load of the sched_entity.
>> Similarly to what was done before the rewrite of PELT, we add a correction
>> factor in case the task group's load is greater than its share so it will
>> contribute the same load of a task of equal weight.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>
>
> I did the below on top, that basically moves code about a bit to reduce
> some #ifdef and kills a few comments that I thought were of the:
>
>         i++; /* increment by one */
>
> quality.

OK. The changes look fine to me

>
[snip]
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load
  2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
                   ` (5 preceding siblings ...)
  2016-11-08  9:53 ` [PATCH 6/6 v7] sched: fix task group initialization Vincent Guittot
@ 2016-11-10 17:04 ` Dietmar Eggemann
  2016-11-12  9:27   ` Vincent Guittot
  6 siblings, 1 reply; 17+ messages in thread
From: Dietmar Eggemann @ 2016-11-10 17:04 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel
  Cc: yuyang.du, Morten.Rasmussen, pjt, bsegall, kernellwp

On 08/11/16 09:53, Vincent Guittot wrote:
> Ensure that the move of a sched_entity will be reflected in load and
> utilization of the task_group hierarchy.
> 
> When a sched_entity moves between groups or CPUs, load and utilization
> of cfs_rq don't reflect the changes immediately but converge to new values.
> As a result, the metrics are no more aligned with the new balance of the
> load in the system and next decisions will have a biased view.
> 
> This patchset synchronizes load/utilization of sched_entity with its child
> cfs_rq (se->my-q) only when tasks move to/from child cfs_rq:
> -move between task group
> -migration between CPUs
> Otherwise, PELT is updated as usual.
> 
> This version doesn't include any changes related to discussion that have
> started during the review of the previous version about:
> - encapsulate the sequence for changing the property of a task
> - remove a cfs_rq from list during update_blocked_averages  
> These topics don't gain anything from being added in this patchset as they
> are fairly independent and deserve a separate patch. 

Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

I tested your v7 against tip with a synthetic workload (one task,
run/period = 8ms/16ms) running in a third level task group:
tg_root/tgX/tgXX/tgXXX and switching every 160ms between cpus (cpu1 <->
cpu2) or sched classes (fair<->rt) or task groups
(tg_root/tg1/tg11/tg111 <-> tg_root/tg2/tg21/tg211).

I shared the results under:

https://drive.google.com/drive/folders/0B2f-ZAwV_YnmYUM4X0NOOXZxdkk

The html files contain diagrams showing the value of the individual PELT
signals (util_avg, load_avg, runnable_load_avg, tg_load_avg_contrib) on
cfs_rq's and se's as well as tg's involved. The green signals are tip,
the blue signals are v7.

The diagrams show the aimed behaviour of propagating PELT changes down
the tg hierarchy.
The only small issue is the load_avg, runnable_load_avg,
tg_load_avg_contrib signals in the 'switching between cpus' case
'20161110_reflect_v7_3rd_level_pelt_switch_between_cpus.html'.

The cfs_rq:load_avg signal [cell [11]] is ~500 -> ~500 -> ~350 -> ~300
for tg111 (tg_css_id = 4) -> tg11 (tg_css_id = 3) -> tg1 (tg_css_id = 2)
-> tg_root (tg_css_id = 1) where I expected it to be ~500 on all tg levels.
Since this problem only occurs in the 'switching cpu' test case and only
for load (not utilization) and since the value for the initial run on a
cpu is ~500 all the way down (it only drops once we switched at least
one time) it probably has something to do how we calculate shares and/or
missing PELT updates on idle cpus. But IMHO we can investigate this later.

> Changes since v6:
> -fix warning and error raised by lkp
> 
> Changes since v5:
> - factorize detach entity like for attach
> - fix add_positive
> - Fixed few coding style
> 
> Changes since v4:
> - minor typo and commit message changes
> - move call to cfs_rq_clock_task(cfs_rq) in post_init_entity_util_avg
> 
> Changes since v3:
> - Replaced the 2 arguments of update_load_avg by 1 flags argument
> - Propagated move in runnable_load_avg when sched_entity is already on_rq
> - Ensure that intermediate value will not reach memory when updating load and
>   utilization
> - Optimize the the calculation of load_avg of the sched_entity
> - Fixed some typo
> 
> Changes since v2:
> - Propagate both utilization and load
> - Synced sched_entity and se->my_q instead of adding the delta
> 
> Changes since v1:
> - This patch needs the patch that fixes issue with rq->leaf_cfs_rq_list
>   "sched: fix hierarchical order in rq->leaf_cfs_rq_list" in order to work
>   correctly. I haven't sent them as a single patchset because the fix is
>   independent of this one
> - Merge some functions that are always used together
> - During update of blocked load, ensure that the sched_entity is synced
>   with the cfs_rq applying changes
> - Fix an issue when task changes its cpu affinity
> 
> Vincent Guittot (6):
>   sched: factorize attach/detach entity
>   sched: fix hierarchical order in rq->leaf_cfs_rq_list
>   sched: factorize PELT update
>   sched: propagate load during synchronous attach/detach
>   sched: propagate asynchrous detach
>   sched: fix task group initialization
> 
>  kernel/sched/core.c  |   1 +
>  kernel/sched/fair.c  | 395 ++++++++++++++++++++++++++++++++++++++++-----------
>  kernel/sched/sched.h |   2 +
>  3 files changed, 318 insertions(+), 80 deletions(-)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load
  2016-11-10 17:04 ` [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Dietmar Eggemann
@ 2016-11-12  9:27   ` Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: Vincent Guittot @ 2016-11-12  9:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Yuyang Du,
	Morten Rasmussen, Paul Turner, Ben Segall, Wanpeng Li

On 10 November 2016 at 18:04, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 08/11/16 09:53, Vincent Guittot wrote:
>> Ensure that the move of a sched_entity will be reflected in load and
>> utilization of the task_group hierarchy.
>>
>> When a sched_entity moves between groups or CPUs, load and utilization
>> of cfs_rq don't reflect the changes immediately but converge to new values.
>> As a result, the metrics are no more aligned with the new balance of the
>> load in the system and next decisions will have a biased view.
>>
>> This patchset synchronizes load/utilization of sched_entity with its child
>> cfs_rq (se->my-q) only when tasks move to/from child cfs_rq:
>> -move between task group
>> -migration between CPUs
>> Otherwise, PELT is updated as usual.
>>
>> This version doesn't include any changes related to discussion that have
>> started during the review of the previous version about:
>> - encapsulate the sequence for changing the property of a task
>> - remove a cfs_rq from list during update_blocked_averages
>> These topics don't gain anything from being added in this patchset as they
>> are fairly independent and deserve a separate patch.
>
> Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks

>
> I tested your v7 against tip with a synthetic workload (one task,
> run/period = 8ms/16ms) running in a third level task group:
> tg_root/tgX/tgXX/tgXXX and switching every 160ms between cpus (cpu1 <->
> cpu2) or sched classes (fair<->rt) or task groups
> (tg_root/tg1/tg11/tg111 <-> tg_root/tg2/tg21/tg211).
>
> I shared the results under:
>
> https://drive.google.com/drive/folders/0B2f-ZAwV_YnmYUM4X0NOOXZxdkk
>
> The html files contain diagrams showing the value of the individual PELT
> signals (util_avg, load_avg, runnable_load_avg, tg_load_avg_contrib) on
> cfs_rq's and se's as well as tg's involved. The green signals are tip,
> the blue signals are v7.
>
> The diagrams show the aimed behaviour of propagating PELT changes down
> the tg hierarchy.
> The only small issue is the load_avg, runnable_load_avg,
> tg_load_avg_contrib signals in the 'switching between cpus' case
> '20161110_reflect_v7_3rd_level_pelt_switch_between_cpus.html'.
>
> The cfs_rq:load_avg signal [cell [11]] is ~500 -> ~500 -> ~350 -> ~300
> for tg111 (tg_css_id = 4) -> tg11 (tg_css_id = 3) -> tg1 (tg_css_id = 2)
> -> tg_root (tg_css_id = 1) where I expected it to be ~500 on all tg levels.
> Since this problem only occurs in the 'switching cpu' test case and only
> for load (not utilization) and since the value for the initial run on a
> cpu is ~500 all the way down (it only drops once we switched at least
> one time) it probably has something to do how we calculate shares and/or
> missing PELT updates on idle cpus. But IMHO we can investigate this later.

Yes it can happen because the migration between CPU is asynchronous
which means that load of the migrated task can still be present on the
previous CPU when new shares are computed

>
>> Changes since v6:
>> -fix warning and error raised by lkp
>>
>> Changes since v5:
>> - factorize detach entity like for attach
>> - fix add_positive
>> - Fixed few coding style
>>
>> Changes since v4:
>> - minor typo and commit message changes
>> - move call to cfs_rq_clock_task(cfs_rq) in post_init_entity_util_avg
>>
>> Changes since v3:
>> - Replaced the 2 arguments of update_load_avg by 1 flags argument
>> - Propagated move in runnable_load_avg when sched_entity is already on_rq
>> - Ensure that intermediate value will not reach memory when updating load and
>>   utilization
>> - Optimize the the calculation of load_avg of the sched_entity
>> - Fixed some typo
>>
>> Changes since v2:
>> - Propagate both utilization and load
>> - Synced sched_entity and se->my_q instead of adding the delta
>>
>> Changes since v1:
>> - This patch needs the patch that fixes issue with rq->leaf_cfs_rq_list
>>   "sched: fix hierarchical order in rq->leaf_cfs_rq_list" in order to work
>>   correctly. I haven't sent them as a single patchset because the fix is
>>   independent of this one
>> - Merge some functions that are always used together
>> - During update of blocked load, ensure that the sched_entity is synced
>>   with the cfs_rq applying changes
>> - Fix an issue when task changes its cpu affinity
>>
>> Vincent Guittot (6):
>>   sched: factorize attach/detach entity
>>   sched: fix hierarchical order in rq->leaf_cfs_rq_list
>>   sched: factorize PELT update
>>   sched: propagate load during synchronous attach/detach
>>   sched: propagate asynchrous detach
>>   sched: fix task group initialization
>>
>>  kernel/sched/core.c  |   1 +
>>  kernel/sched/fair.c  | 395 ++++++++++++++++++++++++++++++++++++++++-----------
>>  kernel/sched/sched.h |   2 +
>>  3 files changed, 318 insertions(+), 80 deletions(-)
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched/fair: Factorize attach/detach entity
  2016-11-08  9:53 ` [PATCH 1/6 v7] sched: factorize attach/detach entity Vincent Guittot
@ 2016-11-16 12:15   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-11-16 12:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, hpa, torvalds, peterz, vincent.guittot, linux-kernel,
	tglx, dietmar.eggemann

Commit-ID:  df217913e72ec7e603d8b68cc4c70646cf7000db
Gitweb:     http://git.kernel.org/tip/df217913e72ec7e603d8b68cc4c70646cf7000db
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 8 Nov 2016 10:53:42 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Nov 2016 10:29:08 +0100

sched/fair: Factorize attach/detach entity

Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq()
in one function attach_entity_cfs_rq().

Create symmetric detach_entity_cfs_rq() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e6c00a..0731aff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,9 +701,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq);
-static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static void attach_entity_cfs_rq(struct sched_entity *se);
 
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
@@ -735,7 +733,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	struct sched_avg *sa = &se->avg;
 	long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
@@ -763,14 +760,12 @@ void post_init_entity_util_avg(struct sched_entity *se)
 			 * such that the next switched_to_fair() has the
 			 * expected state.
 			 */
-			se->avg.last_update_time = now;
+			se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
 			return;
 		}
 	}
 
-	update_cfs_rq_load_avg(now, cfs_rq, false);
-	attach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq, false);
+	attach_entity_cfs_rq(se);
 }
 
 #else /* !CONFIG_SMP */
@@ -8783,30 +8778,19 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	return false;
 }
 
-static void detach_task_cfs_rq(struct task_struct *p)
+static void detach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 
-	if (!vruntime_normalized(p)) {
-		/*
-		 * Fix up our vruntime so that the current sleep doesn't
-		 * cause 'unlimited' sleep bonus.
-		 */
-		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
-	}
-
 	/* Catch up with the cfs_rq and remove our load when we leave */
 	update_cfs_rq_load_avg(now, cfs_rq, false);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
 
-static void attach_task_cfs_rq(struct task_struct *p)
+static void attach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
 
@@ -8818,10 +8802,35 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	se->depth = se->parent ? se->parent->depth + 1 : 0;
 #endif
 
-	/* Synchronize task with its cfs_rq */
+	/* Synchronize entity with its cfs_rq */
 	update_cfs_rq_load_avg(now, cfs_rq, false);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+}
+
+static void detach_task_cfs_rq(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (!vruntime_normalized(p)) {
+		/*
+		 * Fix up our vruntime so that the current sleep doesn't
+		 * cause 'unlimited' sleep bonus.
+		 */
+		place_entity(cfs_rq, se, 0);
+		se->vruntime -= cfs_rq->min_vruntime;
+	}
+
+	detach_entity_cfs_rq(se);
+}
+
+static void attach_task_cfs_rq(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	attach_entity_cfs_rq(se);
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
  2016-11-08  9:53 ` [PATCH 2/6 v7] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
@ 2016-11-16 12:15   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-11-16 12:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, dietmar.eggemann, peterz, tglx, vincent.guittot,
	mingo, linux-kernel

Commit-ID:  9c2791f936ef5fd04a118b5c284f2c9a95f4a647
Gitweb:     http://git.kernel.org/tip/9c2791f936ef5fd04a118b5c284f2c9a95f4a647
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 8 Nov 2016 10:53:43 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Nov 2016 10:29:08 +0100

sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list

Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit:

  67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its
parent.

Lets take the example of:

       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last
whereas it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 54 +++++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  1 +
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bf1fd3..dc64bd7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7604,6 +7604,7 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
 		/*
 		 * How much cpu bandwidth does root_task_group get?
 		 *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0731aff..4a67026 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -283,19 +283,59 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+		int cpu = cpu_of(rq);
 		/*
 		 * Ensure we either appear before our parent (if already
 		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
+		 * enqueued. The fact that we always enqueue bottom-up
+		 * reduces this to two cases and a special case for the root
+		 * cfs_rq. Furthermore, it also means that we will always reset
+		 * tmp_alone_branch either when the branch is connected
+		 * to a tree or when we reach the beg of the tree
 		 */
 		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
+		    cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
+			/*
+			 * If parent is already on the list, we add the child
+			 * just before. Thanks to circular linked property of
+			 * the list, this means to put the child at the tail
+			 * of the list that starts by parent.
+			 */
+			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+			/*
+			 * The branch is now connected to its tree so we can
+			 * reset tmp_alone_branch to the beginning of the
+			 * list.
+			 */
+			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		} else if (!cfs_rq->tg->parent) {
+			/*
+			 * cfs rq without parent should be put
+			 * at the tail of the list.
+			 */
 			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+				&rq->leaf_cfs_rq_list);
+			/*
+			 * We have reach the beg of a tree so we can reset
+			 * tmp_alone_branch to the beginning of the list.
+			 */
+			rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		} else {
+			/*
+			 * The parent has not already been added so we want to
+			 * make sure that it will be put after us.
+			 * tmp_alone_branch points to the beg of the branch
+			 * where we will add parent.
+			 */
+			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				rq->tmp_alone_branch);
+			/*
+			 * update tmp_alone_branch to points to the new beg
+			 * of the branch
+			 */
+			rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list;
 		}
 
 		cfs_rq->on_list = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 345c1cc..36f30e0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -623,6 +623,7 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
+	struct list_head *tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched/fair: Factorize PELT update
  2016-11-08  9:53 ` [PATCH 3/6 v7] sched: factorize PELT update Vincent Guittot
@ 2016-11-16 12:16   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-11-16 12:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, hpa, dietmar.eggemann, torvalds, linux-kernel,
	tglx, vincent.guittot

Commit-ID:  d31b1a66cbe0931733583ad9d9e8c6cfd710907d
Gitweb:     http://git.kernel.org/tip/d31b1a66cbe0931733583ad9d9e8c6cfd710907d
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 8 Nov 2016 10:53:44 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Nov 2016 10:29:09 +0100

sched/fair: Factorize PELT update

Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:

 - when attaching/detaching a sched_entity, we update cfs_rq and then
   we sync the entity with the cfs_rq.

 - when enqueueing/dequeuing the sched_entity, we update both
   sched_entity and cfs_rq metrics to now.

Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 76 ++++++++++++++++++-----------------------------------
 1 file changed, 25 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a67026..d707ad0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3092,8 +3092,14 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return decayed || removed_load;
 }
 
+/*
+ * Optional action to be done while updating the load average
+ */
+#define UPDATE_TG	0x1
+#define SKIP_AGE_LOAD	0x2
+
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_load_avg(struct sched_entity *se, int flags)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3104,11 +3110,13 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	__update_load_avg(now, cpu, &se->avg,
+	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
+		__update_load_avg(now, cpu, &se->avg,
 			  se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
+	}
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && update_tg)
+	if (update_cfs_rq_load_avg(now, cfs_rq, true) && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);
 }
 
@@ -3122,26 +3130,6 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
  */
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		goto skip_aging;
-
-	/*
-	 * If we got migrated (either between CPUs or between cgroups) we'll
-	 * have aged the average right before clearing @last_update_time.
-	 *
-	 * Or we're fresh through post_init_entity_util_avg().
-	 */
-	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-				  &se->avg, 0, 0, NULL);
-
-		/*
-		 * XXX: we could have just aged the entire load away if we've been
-		 * absent from the fair class for too long.
-		 */
-	}
-
-skip_aging:
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se->avg.load_sum;
@@ -3161,9 +3149,6 @@ skip_aging:
  */
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
-			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
 
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
@@ -3178,34 +3163,20 @@ static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_avg *sa = &se->avg;
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	int migrated, decayed;
-
-	migrated = !sa->last_update_time;
-	if (!migrated) {
-		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-			se->on_rq * scale_load_down(se->load.weight),
-			cfs_rq->curr == se, NULL);
-	}
-
-	decayed = update_cfs_rq_load_avg(now, cfs_rq, !migrated);
 
 	cfs_rq->runnable_load_avg += sa->load_avg;
 	cfs_rq->runnable_load_sum += sa->load_sum;
 
-	if (migrated)
+	if (!sa->last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
-
-	if (decayed || migrated)
 		update_tg_load_avg(cfs_rq, 0);
+	}
 }
 
 /* Remove the runnable load generated by se from cfs_rq's runnable load average */
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
-
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
 	cfs_rq->runnable_load_sum =
@@ -3289,7 +3260,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	return 0;
 }
 
-static inline void update_load_avg(struct sched_entity *se, int not_used)
+#define UPDATE_TG	0x0
+#define SKIP_AGE_LOAD	0x0
+
+static inline void update_load_avg(struct sched_entity *se, int not_used1)
 {
 	cpufreq_update_util(rq_of(cfs_rq_of(se)), 0);
 }
@@ -3434,6 +3408,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
+	update_load_avg(se, UPDATE_TG);
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -3508,6 +3483,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_load_avg(se, UPDATE_TG);
 	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -3595,7 +3571,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_load_avg(se, UPDATE_TG);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3713,7 +3689,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_load_avg(curr, UPDATE_TG);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4610,7 +4586,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, UPDATE_TG);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4669,7 +4645,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, UPDATE_TG);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -8821,10 +8797,9 @@ static inline bool vruntime_normalized(struct task_struct *p)
 static void detach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	update_load_avg(se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }
@@ -8832,7 +8807,6 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
 static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -8843,7 +8817,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 #endif
 
 	/* Synchronize entity with its cfs_rq */
-	update_cfs_rq_load_avg(now, cfs_rq, false);
+	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched/fair: Propagate load during synchronous attach/detach
  2016-11-08  9:53 ` [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach Vincent Guittot
  2016-11-09 15:03   ` Peter Zijlstra
@ 2016-11-16 12:16   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-11-16 12:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, mingo, peterz, dietmar.eggemann, vincent.guittot, hpa,
	linux-kernel, tglx

Commit-ID:  09a43ace1f986b003c118fdf6ddf1fd685692d49
Gitweb:     http://git.kernel.org/tip/09a43ace1f986b003c118fdf6ddf1fd685692d49
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 8 Nov 2016 10:53:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Nov 2016 10:29:10 +0100

sched/fair: Propagate load during synchronous attach/detach

When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 188 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   1 +
 2 files changed, 188 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d707ad0..8cf26fd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2918,6 +2918,26 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	return decayed;
 }
 
+/*
+ * Signed add and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define add_positive(_ptr, _val) do {                           \
+	typeof(_ptr) ptr = (_ptr);                              \
+	typeof(_val) val = (_val);                              \
+	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
+								\
+	res = var + val;                                        \
+								\
+	if (val < 0 && res > var)                               \
+		res = 0;                                        \
+								\
+	WRITE_ONCE(*ptr, res);                                  \
+} while (0)
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /**
  * update_tg_load_avg - update the tg's load avg
@@ -2997,8 +3017,138 @@ void set_task_rq_fair(struct sched_entity *se,
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
+
+/* Take into account change of utilization of a child task group */
+static inline void
+update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's utilization */
+	se->avg.util_avg = gcfs_rq->avg.util_avg;
+	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq utilization */
+	add_positive(&cfs_rq->avg.util_avg, delta);
+	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
+}
+
+/* Take into account change of load of a child task group */
+static inline void
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long delta, load = gcfs_rq->avg.load_avg;
+
+	/*
+	 * If the load of group cfs_rq is null, the load of the
+	 * sched_entity will also be null so we can skip the formula
+	 */
+	if (load) {
+		long tg_load;
+
+		/* Get tg's load and ensure tg_load > 0 */
+		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
+
+		/* Ensure tg_load >= load and updated with current load*/
+		tg_load -= gcfs_rq->tg_load_avg_contrib;
+		tg_load += load;
+
+		/*
+		 * We need to compute a correction term in the case that the
+		 * task group is consuming more CPU than a task of equal
+		 * weight. A task with a weight equals to tg->shares will have
+		 * a load less or equal to scale_load_down(tg->shares).
+		 * Similarly, the sched_entities that represent the task group
+		 * at parent level, can't have a load higher than
+		 * scale_load_down(tg->shares). And the Sum of sched_entities'
+		 * load must be <= scale_load_down(tg->shares).
+		 */
+		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
+			/* scale gcfs_rq's load into tg's shares*/
+			load *= scale_load_down(gcfs_rq->tg->shares);
+			load /= tg_load;
+		}
+	}
+
+	delta = load - se->avg.load_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's load */
+	se->avg.load_avg = load;
+	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq load */
+	add_positive(&cfs_rq->avg.load_avg, delta);
+	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+
+	/*
+	 * If the sched_entity is already enqueued, we also have to update the
+	 * runnable load avg.
+	 */
+	if (se->on_rq) {
+		/* Update parent cfs_rq runnable_load_avg */
+		add_positive(&cfs_rq->runnable_load_avg, delta);
+		cfs_rq->runnable_load_sum = cfs_rq->runnable_load_avg * LOAD_AVG_MAX;
+	}
+}
+
+static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->propagate_avg = 1;
+}
+
+static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
+
+	if (!cfs_rq->propagate_avg)
+		return 0;
+
+	cfs_rq->propagate_avg = 0;
+	return 1;
+}
+
+/* Update task and its cfs_rq load average */
+static inline int propagate_entity_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	if (entity_is_task(se))
+		return 0;
+
+	if (!test_and_clear_tg_cfs_propagate(se))
+		return 0;
+
+	cfs_rq = cfs_rq_of(se);
+
+	set_tg_cfs_propagate(cfs_rq);
+
+	update_tg_cfs_util(cfs_rq, se);
+	update_tg_cfs_load(cfs_rq, se);
+
+	return 1;
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
+
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
+
+static inline int propagate_entity_load_avg(struct sched_entity *se)
+{
+	return 0;
+}
+
+static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
@@ -3105,6 +3255,7 @@ static inline void update_load_avg(struct sched_entity *se, int flags)
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
+	int decayed;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -3116,7 +3267,10 @@ static inline void update_load_avg(struct sched_entity *se, int flags)
 			  cfs_rq->curr == se, NULL);
 	}
 
-	if (update_cfs_rq_load_avg(now, cfs_rq, true) && (flags & UPDATE_TG))
+	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
+	decayed |= propagate_entity_load_avg(se);
+
+	if (decayed && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);
 }
 
@@ -3135,6 +3289,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq->avg.load_sum += se->avg.load_sum;
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	set_tg_cfs_propagate(cfs_rq);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -3154,6 +3309,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	set_tg_cfs_propagate(cfs_rq);
 
 	cfs_rq_util_change(cfs_rq);
 }
@@ -8794,6 +8950,31 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	return false;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Propagate the changes of the sched_entity across the tg tree to make it
+ * visible to the root
+ */
+static void propagate_entity_cfs_rq(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	/* Start to propagate at parent */
+	se = se->parent;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, UPDATE_TG);
+	}
+}
+#else
+static void propagate_entity_cfs_rq(struct sched_entity *se) { }
+#endif
+
 static void detach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -8802,6 +8983,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
 	update_load_avg(se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+	propagate_entity_cfs_rq(se);
 }
 
 static void attach_entity_cfs_rq(struct sched_entity *se)
@@ -8820,6 +9002,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 	update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
+	propagate_entity_cfs_rq(se);
 }
 
 static void detach_task_cfs_rq(struct task_struct *p)
@@ -8898,6 +9081,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
 #ifdef CONFIG_SMP
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	cfs_rq->propagate_avg = 0;
+#endif
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
 	atomic_long_set(&cfs_rq->removed_util_avg, 0);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 36f30e0..d7e3931 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -404,6 +404,7 @@ struct cfs_rq {
 	unsigned long runnable_load_avg;
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
+	unsigned long propagate_avg;
 #endif
 	atomic_long_t removed_load_avg, removed_util_avg;
 #ifndef CONFIG_64BIT

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched/fair: Propagate asynchrous detach
  2016-11-08  9:53 ` [PATCH 5/6 v7] sched: propagate asynchrous detach Vincent Guittot
@ 2016-11-16 12:17   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-11-16 12:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, torvalds, dietmar.eggemann, tglx, peterz, mingo,
	vincent.guittot, hpa

Commit-ID:  4e5160766fcc9f41bbd38bac11f92dce993644aa
Gitweb:     http://git.kernel.org/tip/4e5160766fcc9f41bbd38bac11f92dce993644aa
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 8 Nov 2016 10:53:46 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Nov 2016 10:29:10 +0100

sched/fair: Propagate asynchrous detach

A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

  "sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cf26fd..090a9bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3219,6 +3219,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		sub_positive(&sa->load_avg, r);
 		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
+		set_tg_cfs_propagate(cfs_rq);
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
@@ -3226,6 +3227,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
+		set_tg_cfs_propagate(cfs_rq);
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -6872,6 +6874,10 @@ static void update_blocked_averages(int cpu)
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
+
+		/* Propagate pending load changes to the parent */
+		if (cfs_rq->tg->se[cpu])
+			update_load_avg(cfs_rq->tg->se[cpu], 0);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched/fair: Fix task group initialization
  2016-11-08  9:53 ` [PATCH 6/6 v7] sched: fix task group initialization Vincent Guittot
@ 2016-11-16 12:17   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-11-16 12:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, vincent.guittot, torvalds, linux-kernel, dietmar.eggemann,
	hpa, peterz, mingo

Commit-ID:  d03266910a533d874c01ef2ca8dc73009f2925fa
Gitweb:     http://git.kernel.org/tip/d03266910a533d874c01ef2ca8dc73009f2925fa
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 8 Nov 2016 10:53:47 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Nov 2016 10:29:11 +0100

sched/fair: Fix task group initialization

The moves of tasks are now propagated down to root and the utilization
of cfs_rq reflects reality so it doesn't need to be estimated at init.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 090a9bb..02605f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9198,7 +9198,7 @@ void online_fair_sched_group(struct task_group *tg)
 		se = tg->se[i];
 
 		raw_spin_lock_irq(&rq->lock);
-		post_init_entity_util_avg(se);
+		attach_entity_cfs_rq(se);
 		sync_throttle(tg, i);
 		raw_spin_unlock_irq(&rq->lock);
 	}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-11-16 12:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-08  9:53 [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Vincent Guittot
2016-11-08  9:53 ` [PATCH 1/6 v7] sched: factorize attach/detach entity Vincent Guittot
2016-11-16 12:15   ` [tip:sched/core] sched/fair: Factorize " tip-bot for Vincent Guittot
2016-11-08  9:53 ` [PATCH 2/6 v7] sched: fix hierarchical order in rq->leaf_cfs_rq_list Vincent Guittot
2016-11-16 12:15   ` [tip:sched/core] sched/fair: Fix " tip-bot for Vincent Guittot
2016-11-08  9:53 ` [PATCH 3/6 v7] sched: factorize PELT update Vincent Guittot
2016-11-16 12:16   ` [tip:sched/core] sched/fair: Factorize " tip-bot for Vincent Guittot
2016-11-08  9:53 ` [PATCH 4/6 v7] sched: propagate load during synchronous attach/detach Vincent Guittot
2016-11-09 15:03   ` Peter Zijlstra
2016-11-09 15:23     ` Vincent Guittot
2016-11-16 12:16   ` [tip:sched/core] sched/fair: Propagate " tip-bot for Vincent Guittot
2016-11-08  9:53 ` [PATCH 5/6 v7] sched: propagate asynchrous detach Vincent Guittot
2016-11-16 12:17   ` [tip:sched/core] sched/fair: Propagate " tip-bot for Vincent Guittot
2016-11-08  9:53 ` [PATCH 6/6 v7] sched: fix task group initialization Vincent Guittot
2016-11-16 12:17   ` [tip:sched/core] sched/fair: Fix " tip-bot for Vincent Guittot
2016-11-10 17:04 ` [PATCH 0/6 v7] sched: reflect sched_entity move into task_group's load Dietmar Eggemann
2016-11-12  9:27   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).