From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1423744AbcFIHNR (ORCPT <rfc822;w@1wt.eu>);
	Thu, 9 Jun 2016 03:13:17 -0400
Received: from mga04.intel.com ([192.55.52.120]:20088 "EHLO mga04.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1423679AbcFIHNO (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 9 Jun 2016 03:13:14 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.26,443,1459839600"; 
   d="scan'208";a="972042455"
From: Yuyang Du <yuyang.du@intel.com>
To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org
Cc: bsegall@google.com, pjt@google.com, morten.rasmussen@arm.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        matt@codeblueprint.co.uk, Yuyang Du <yuyang.du@intel.com>
Subject: [PATCH v5 2/5] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group
Date: Thu,  9 Jun 2016 07:15:51 +0800
Message-Id: <1465427754-28897-3-git-send-email-yuyang.du@intel.com>
X-Mailer: git-send-email 1.7.9.5
In-Reply-To: <1465427754-28897-1-git-send-email-yuyang.du@intel.com>
References: <1465427754-28897-1-git-send-email-yuyang.du@intel.com>
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Vincent reported that the first task to a new task group's cfs_rq will
be attached in attach_task_cfs_rq() and once more when it is enqueued
(see https://lkml.org/lkml/2016/5/25/388).

Actually, it is worse. The sched avgs can be sometimes attached twice
not only when we change task groups but also when we switch to fair class.
These two scenarios will be described in the following respectively.

(1) Switch to fair class:

The sched class change is done like this:

	if (queued)
	  enqueue_task();
	check_class_changed()
	  switched_from()
	  switched_to()

If the task is on_rq, before switched_to(), it has been enqueued, which
already attached sched avgs to the cfs_rq if the task's last_update_time
is 0, which can happen if the task was never fair class, if so, we
shouldn't attach it again in switched_to(), otherwise, we attach it twice.

To address both the on_rq and !on_rq cases, as well as both the task
was switched from fair and otherwise, the simplest solution is to reset
the task's last_update_time to 0 when the task is switched from fair.
Then let task enqueue do the sched avgs attachment uniformly only once.

(2) Change between fair task groups:

The task groups are changed like this:

	if (queued)
          dequeue_task()
	task_move_group()
	if (queued)
          enqueue_task()

Unlike the switch to fair class case, if the task is on_rq, it will be
enqueued right away after we move task groups, and if not, in the future
when the task is runnable. The attach twice problem can happen if the
cfs_rq and the task are both new as Vincent discovered. The simplest
solution is to only reset the task's last_update_time in task_move_group(),
and then let enqueue_task() do the sched avgs attachment.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   71 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 46 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34e658b..68fcd2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2933,7 +2933,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 		update_tg_load_avg(cfs_rq, 0);
 }
 
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/* Synchronize task with its cfs_rq */
+static inline void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	se->avg.last_update_time = cfs_rq->avg.last_update_time;
 	cfs_rq->avg.load_avg += se->avg.load_avg;
@@ -2944,19 +2945,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq_util_change(cfs_rq);
 }
 
-static inline void attach_age_load_task(struct rq *rq, struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		return;
-
-	if (se->avg.last_update_time) {
-		__update_load_avg(cfs_rq_of(se)->avg.last_update_time, cpu_of(rq),
-				  &se->avg, 0, 0, NULL);
-	}
-}
-
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
@@ -3031,6 +3019,11 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 }
 #endif
 
+static inline void reset_task_last_update_time(struct task_struct *p)
+{
+	p->se.avg.last_update_time = 0;
+}
+
 /*
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
@@ -3083,10 +3076,8 @@ dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
 
 static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void
 detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void attach_age_load_task(struct rq *rq, struct task_struct *p) {}
+static inline void reset_task_last_update_time(struct task_struct *p) {}
 
 static inline int idle_balance(struct rq *rq)
 {
@@ -8372,8 +8363,19 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	se->depth = se->parent ? se->parent->depth + 1 : 0;
 #endif
 
-	/* Synchronize task with its cfs_rq */
-	attach_entity_load_avg(cfs_rq, se);
+	/*
+	 * At this point, we don't do sched avgs attachment. Instead,
+	 * we uniformly do the attachement at enqueue time in all
+	 * scenarios:
+	 *
+	 * (1) Actually, we may have already done it (e.g., at enqueue_task()
+	 * in __sched_setscheduler() when a task switches to fair class).
+	 *
+	 * (2) We will do it right away (e.g., in sched_move_task() when
+	 * an on_rq task moves between groups)
+	 *
+	 * (3) In the future when this currently !on_rq task is enqueued.
+	 */
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
@@ -8382,16 +8384,30 @@ static void attach_task_cfs_rq(struct task_struct *p)
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
+	/*
+	 * If the task changes back to fair class, we will attach the
+	 * task's sched avgs when it is enqueued, as the task's last_update_time
+	 * is reset to 0.
+	 *
+	 * This simplifies sched avgs attachment when a task is switched
+	 * to fair class, as we can unify the case where the task was
+	 * never fair class and the case where the task was fair class.
+	 *
+	 * Having lost last_update_time, we can't age the task's sched avgs
+	 * before attaching them, so the last updated sched avgs (in the
+	 * above detach_task_cfs_rq()) will be used.
+	 *
+	 * Typically the time between switched_from_fair() and switched_to_fair()
+	 * most likely could have just aged the entire load/util away.
+	 * Despite that, there is no better way to account for the lost
+	 * record of sched avgs during that time, therefore, we simply
+	 * lean toward no aging at all.
+	 */
+	reset_task_last_update_time(p);
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * If we change between classes, age the averages before attaching them.
-	 * XXX: we could have just aged the entire load away if we've been
-	 * absent from the fair class for too long.
-	 */
-	attach_age_load_task(rq, p);
 	attach_task_cfs_rq(p);
 
 	if (task_on_rq_queued(p)) {
@@ -8444,6 +8460,11 @@ static void task_move_group_fair(struct task_struct *p)
 	detach_task_cfs_rq(p);
 	set_task_rq(p, task_cpu(p));
 	attach_task_cfs_rq(p);
+	/*
+	 * This ensures we will attach the sched avgs when the task is
+	 * enqueued as the task's last_update_time is reset to 0.
+	 */
+	reset_task_last_update_time(p);
 }
 
 void free_fair_sched_group(struct task_group *tg)
-- 
1.7.9.5