linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vincent Donnefort <vincent.donnefort@arm.com>
To: peterz@infradead.org, mingo@redhat.com, vincent.guittot@linaro.org
Cc: linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com,
	valentin.schneider@arm.com, morten.rasmussen@arm.com,
	chris.redpath@arm.com, qperret@google.com, lukasz.luba@arm.com,
	Vincent Donnefort <vincent.donnefort@arm.com>
Subject: [PATCH 2/4] sched/fair: Decay task PELT values during migration
Date: Thu,  9 Dec 2021 16:11:57 +0000	[thread overview]
Message-ID: <20211209161159.1596018-3-vincent.donnefort@arm.com> (raw)
In-Reply-To: <20211209161159.1596018-1-vincent.donnefort@arm.com>

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

   ---------|----- A -----|-----------|------- B -----|>
        clock_pelt   clock_task     clock            now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

  A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. The new runqueue clock_pelt_lag is therefore created and
encode those three values.

  clock_pelt_lag = clock - clock_task + clock_pelt

And we can then write the missing time as follow:

  A + B = sched_clock_cpu() - clock_pelt_lag

The B. part of the missing time is however an estimation that doesn't take
into account IRQ and Paravirt time.

Now we have an estimation for A + B, we can create an estimator for the
PELT value at the time of the migration. We need for this purpose to
inject last_update_time which is a combination of both clock_pelt and
lost_idle_time. The latter is a time value which is completely lost form a
PELT point of view and must be ignored. And finally, we can write:

  rq_clock_pelt_estimator() = last_update_time + A + B
                            = last_update_time +
                                   sched_clock_cpu() - clock_pelt_lag

[1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2611b9cf503..3649716d1de7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -607,6 +607,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	}
 }
 
+static void update_rq_clock_pelt_lag(struct rq *rq)
+{
+	u64_u32_store(rq->clock_pelt_lag,
+		      rq->clock - rq->clock_task + rq->clock_pelt);
+}
+
 /*
  * RQ-clock updating methods:
  */
@@ -663,6 +669,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
 	update_rq_clock_pelt(rq, delta);
+	update_rq_clock_pelt_lag(rq);
 }
 
 void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e6182dabc99..9d640d519866 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6899,6 +6899,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 
 static void detach_entity_cfs_rq(struct sched_entity *se);
 
+static u64 rq_clock_pelt_estimator(struct rq *rq)
+{
+	u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
+		       u64_u32_load(rq->clock_pelt_lag);
+
+	return cfs_rq_last_update_time(&rq->cfs) + pelt_lag;
+}
+
 /*
  * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6906,6 +6914,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
  */
 static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
+	struct sched_entity *se = &p->se;
+	struct rq *rq = task_rq(p);
+
 	/*
 	 * As blocked tasks retain absolute vruntime the migration needs to
 	 * deal with this by subtracting the old and adding the new
@@ -6913,7 +6924,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	 * the task on the new runqueue.
 	 */
 	if (READ_ONCE(p->__state) == TASK_WAKING) {
-		struct sched_entity *se = &p->se;
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
@@ -6924,26 +6934,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_rq_held(task_rq(p));
-		detach_entity_cfs_rq(&p->se);
+		lockdep_assert_rq_held(rq);
+		detach_entity_cfs_rq(se);
 
 	} else {
+		remove_entity_load_avg(se);
+
 		/*
-		 * We are supposed to update the task to "current" time, then
-		 * its up to date and ready to go to new CPU/cfs_rq. But we
-		 * have difficulty in getting what current time is, so simply
-		 * throw away the out-of-date time. This will result in the
-		 * wakee task is less decayed, but giving the wakee more load
-		 * sounds not bad.
+		 * Here, the task's PELT values have been updated according to
+		 * the current rq's clock. But if that clock hasn't been
+		 * updated in a while, a substantial idle time will be missed,
+		 * leading to an inflation after wake-up on the new rq.
+		 *
+		 * Estimate the PELT clock lag, and update sched_avg to ensure
+		 * PELT continuity after migration.
 		 */
-		remove_entity_load_avg(&p->se);
+		__update_load_avg_blocked_se(rq_clock_pelt_estimator(rq), se);
 	}
 
 	/* Tell new CPU we are migrated */
-	p->se.avg.last_update_time = 0;
+	se->avg.last_update_time = 0;
 
 	/* We have migrated, no longer consider this task hot */
-	p->se.exec_start = 0;
+	se->exec_start = 0;
 
 	update_scan_period(p, new_cpu);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bd2058d7bb5..6cd128de8666 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1027,8 +1027,13 @@ struct rq {
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
 	u64			clock_pelt;
+	u64			clock_pelt_lag;
 	unsigned long		lost_idle_time;
 
+#ifndef CONFIG_64BIT
+	u64                     clock_pelt_lag_copy;
+#endif
+
 	atomic_t		nr_iowait;
 
 #ifdef CONFIG_SCHED_DEBUG
-- 
2.25.1


  parent reply	other threads:[~2021-12-09 16:12 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-09 16:11 [PATCH 0/4] feec() energy margin removal Vincent Donnefort
2021-12-09 16:11 ` [PATCH 1/4] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
2021-12-09 16:11 ` Vincent Donnefort [this message]
2021-12-20 11:26   ` [PATCH 2/4] sched/fair: Decay task PELT values during migration Dietmar Eggemann
2021-12-20 16:09     ` Vincent Donnefort
2021-12-21 12:46       ` Dietmar Eggemann
2021-12-09 16:11 ` [PATCH 3/4] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
2022-01-04 17:27   ` Dietmar Eggemann
2022-01-12 10:20     ` Vincent Donnefort
2021-12-09 16:11 ` [PATCH 4/4] sched/fair: Remove the energy margin " Vincent Donnefort

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211209161159.1596018-3-vincent.donnefort@arm.com \
    --to=vincent.donnefort@arm.com \
    --cc=chris.redpath@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lukasz.luba@arm.com \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).