All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] feec() energy margin removal
@ 2022-04-12 13:42 Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath,
	qperret, Vincent Donnefort

find_energy_efficient() (feec()) will migrate a task to save energy only
if it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to migrate
to a smaller CPU, wasting a lot of energy. Instead of trying to determine yet
another margin, let's try to remove it.

The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs of
the topology. Only once those fixed, we can completely remove the margin and
let feec() aggressively place task and save energy.

This has been validated by two different ways:

First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the placement
is now more energy oriented) and *ThreeSmall (as no bouncing between clusters
happen anymore).

  * Hikey960: 100% PASSED
  * DB-845C:  100% PASSED
  * RB5:      100% PASSED

Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline. 

  +------------+-----------------+-----------------+
  |    Test    |      Perf       |    Energy [1]   |
  +------------+-----------------+-----------------+
  | Web2       | -0.3% pval 0.03 | -1.8% pval 0.00 |
  | Video2     | -0.3% pval 0.13 | -5.6% pval 0.00 |
  | Photo2 [2] | -3.8% pval 0.00 | -1%   pval 0.00 |
  | Writing2   |  0%   pval 0.13 | -1%   pval 0.00 |
  | Data2      |  0%   pval 0.8  | -0.43 pval 0.00 |
  +------------+-----------------+-----------------+ 

The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a 
substantial amount of energy, while having a limited impact on performances.

[1] This is an energy estimation based on the CPU activity and the Energy Model
for this device. "All models are wrong but some are useful"; yes, this is an
imperfect estimation that doesn't take into account some idle states and shared
power rails. Nonetheless this is based on the information the kernel has during
runtime and it proves the scheduler can take better decisions based solely on
those data.

[2] This is the only performance impact observed. The debugging of this test
showed no issue with task placement. The better score was solely due to some
critical threads held on better performing CPUs. If a thread needs a higher
capacity CPU, the placement must result from a user input (with e.g. uclamp
min) instead of being artificially held on less efficient CPUs by feec().
Notice also, the experiment didn't use the Android only latency_sensitive
feature which would hide this problem on a real-life device.

v3 -> v4:
  - Minor cosmetic changes (Dietmar)

v2 -> v3:
  - feec(): introduce energy_env struct (Dietmar)
  - PELT migration decay: Only apply when src CPU is idle (Vincent G.)
  - PELT migration decay: Do not apply when cfs_rq is throttled
  - PELT migration decay: Snapshot the lag at cfs_rq's level

v1 -> v2:
  - Fix PELT migration last_update_time (previously root cfs_rq's).
  - Add Dietmar's patches to refactor feec()'s CPU loop.
  - feec(): renaming busy time functions get_{pd,tsk}_busy_time()
  - feec(): pd_cap computation in the first for_each_cpu loop.
  - feec(): create get_pd_max_util() function (previously within compute_energy())
  - feec(): rename base_energy_pd to base_energy.

Dietmar Eggemann (3):
  sched, drivers: Remove max param from
    effective_cpu_util()/sched_cpu_util()
  sched/fair: Rename select_idle_mask to select_rq_mask
  sched/fair: Use the same cpumask per-PD throughout
    find_energy_efficient_cpu()

Vincent Donnefort (4):
  sched/fair: Provide u64 read for 32-bits arch helper
  sched/fair: Decay task PELT values during wakeup migration
  sched/fair: Remove task_util from effective utilization in feec()
  sched/fair: Remove the energy margin in feec()

 drivers/powercap/dtpm_cpu.c       |  33 +--
 drivers/thermal/cpufreq_cooling.c |   6 +-
 include/linux/sched.h             |   2 +-
 kernel/sched/core.c               |  15 +-
 kernel/sched/cpufreq_schedutil.c  |   5 +-
 kernel/sched/fair.c               | 385 ++++++++++++++++++------------
 kernel/sched/sched.h              |  49 +++-
 7 files changed, 298 insertions(+), 197 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-19 12:42   ` Tao Zhou
  2022-04-12 13:42 ` [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration Vincent Donnefort
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath,
	qperret, Vincent Donnefort

Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.

Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.

Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty in set_task_rq_fair() and init_cfs_rq().

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3eba0dcc4962..5dd38c9df0cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -600,11 +600,8 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
-#ifndef CONFIG_64BIT
-	smp_wmb();
-	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
-#endif
+	u64_u32_store(cfs_rq->min_vruntime,
+		      max_vruntime(cfs_rq->min_vruntime, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -3301,6 +3298,11 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
 }
 
 #ifdef CONFIG_SMP
+static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
+{
+	return u64_u32_load_copy(cfs_rq->avg.last_update_time,
+				 cfs_rq->last_update_time_copy);
+}
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
@@ -3411,27 +3413,9 @@ void set_task_rq_fair(struct sched_entity *se,
 	if (!(se->avg.last_update_time && prev))
 		return;
 
-#ifndef CONFIG_64BIT
-	{
-		u64 p_last_update_time_copy;
-		u64 n_last_update_time_copy;
-
-		do {
-			p_last_update_time_copy = prev->load_last_update_time_copy;
-			n_last_update_time_copy = next->load_last_update_time_copy;
-
-			smp_rmb();
+	p_last_update_time = cfs_rq_last_update_time(prev);
+	n_last_update_time = cfs_rq_last_update_time(next);
 
-			p_last_update_time = prev->avg.last_update_time;
-			n_last_update_time = next->avg.last_update_time;
-
-		} while (p_last_update_time != p_last_update_time_copy ||
-			 n_last_update_time != n_last_update_time_copy);
-	}
-#else
-	p_last_update_time = prev->avg.last_update_time;
-	n_last_update_time = next->avg.last_update_time;
-#endif
 	__update_load_avg_blocked_se(p_last_update_time, se);
 	se->avg.last_update_time = n_last_update_time;
 }
@@ -3786,8 +3770,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
 
 #ifndef CONFIG_64BIT
-	smp_wmb();
-	cfs_rq->load_last_update_time_copy = sa->last_update_time;
+	u64_u32_store_copy(sa->last_update_time,
+			   cfs_rq->last_update_time_copy,
+			   sa->last_update_time);
 #endif
 
 	return decayed;
@@ -3921,27 +3906,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	}
 }
 
-#ifndef CONFIG_64BIT
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
-	u64 last_update_time_copy;
-	u64 last_update_time;
-
-	do {
-		last_update_time_copy = cfs_rq->load_last_update_time_copy;
-		smp_rmb();
-		last_update_time = cfs_rq->avg.last_update_time;
-	} while (last_update_time != last_update_time_copy);
-
-	return last_update_time;
-}
-#else
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->avg.last_update_time;
-}
-#endif
-
 /*
  * Synchronize entity load avg of dequeued entity without locking
  * the previous rq.
@@ -6991,21 +6955,8 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	if (READ_ONCE(p->__state) == TASK_WAKING) {
 		struct sched_entity *se = &p->se;
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		u64 min_vruntime;
 
-#ifndef CONFIG_64BIT
-		u64 min_vruntime_copy;
-
-		do {
-			min_vruntime_copy = cfs_rq->min_vruntime_copy;
-			smp_rmb();
-			min_vruntime = cfs_rq->min_vruntime;
-		} while (min_vruntime != min_vruntime_copy);
-#else
-		min_vruntime = cfs_rq->min_vruntime;
-#endif
-
-		se->vruntime -= min_vruntime;
+		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
 	}
 
 	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
@@ -11453,10 +11404,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
-#ifndef CONFIG_64BIT
-	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
-#endif
+	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 762be73972bd..e2cf6e48b165 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -513,6 +513,45 @@ struct cfs_bandwidth { };
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+/*
+ * u64_u32_load/u64_u32_store
+ *
+ * Use a copy of a u64 value to protect against data race. This is only
+ * applicable for 32-bits architectures.
+ */
+#ifdef CONFIG_64BIT
+# define u64_u32_load_copy(var, copy)       var
+# define u64_u32_store_copy(var, copy, val) (var = val)
+#else
+# define u64_u32_load_copy(var, copy)					\
+({									\
+	u64 __val, __val_copy;						\
+	do {								\
+		__val_copy = copy;					\
+		/*							\
+		 * paired with u64_u32_store, ordering access		\
+		 * to var and copy.					\
+		 */							\
+		smp_rmb();						\
+		__val = var;						\
+	} while (__val != __val_copy);					\
+	__val;								\
+})
+# define u64_u32_store_copy(var, copy, val)				\
+do {									\
+	typeof(val) __val = (val);					\
+	var = __val;							\
+	/*								\
+	 * paired with u64_u32_load, ordering access to var and		\
+	 * copy.							\
+	 */								\
+	smp_wmb();							\
+	copy = __val;							\
+} while (0)
+#endif
+# define u64_u32_load(var)      u64_u32_load_copy(var, var##_copy)
+# define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
@@ -553,7 +592,7 @@ struct cfs_rq {
 	 */
 	struct sched_avg	avg;
 #ifndef CONFIG_64BIT
-	u64			load_last_update_time_copy;
+	u64			last_update_time_copy;
 #endif
 	struct {
 		raw_spinlock_t	lock ____cacheline_aligned;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-19 10:08   ` Vincent Guittot
  2022-04-22 15:14   ` Tao Zhou
  2022-04-12 13:42 ` [PATCH v4 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath,
	qperret, Vincent Donnefort

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

   ---------|----- A -----|-----------|------- B -----|>
        clock_pelt   clock_task     clock            now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

  A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. The new cfs_rq last_update_lag is therefore created and
contains those three values when the last_update_time value for that very
same cfs_rq is updated.

  last_update_lag = clock - clock_task + clock_pelt

And we can then write the missing time as follow:

  A + B = sched_clock_cpu() - last_update_lag

The B. part of the missing time is however an estimation that doesn't take
into account IRQ and Paravirt time.

Now we have an estimation for A + B, we can create an estimator for the
PELT value at the time of the migration. We need for this purpose to
inject last_update_time which is a combination of both clock_pelt and
lost_idle_time. The latter is a time value which is completely lost from a
PELT point of view and must be ignored. And finally, we can write:

  now = last_update_time + A + B
      = last_update_time + sched_clock_cpu() - last_update_lag

This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
usage to the case where the source CPU is idle as we know this is when the
clock is having the biggest risk of being outdated.

[1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5dd38c9df0cc..e234d015657f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3694,6 +3694,57 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_NO_HZ_COMMON
+static inline void update_cfs_rq_lag(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+
+	u64_u32_store(cfs_rq->last_update_lag,
+#ifdef CONFIG_CFS_BANDWIDTH
+		      /* Timer stopped by throttling */
+		      unlikely(cfs_rq->throttle_count) ? U64_MAX :
+#endif
+		      rq->clock - rq->clock_task + rq->clock_pelt);
+}
+
+static inline void migrate_se_pelt_lag(struct sched_entity *se)
+{
+	u64 now, last_update_lag;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
+	bool is_idle;
+
+	cfs_rq = cfs_rq_of(se);
+	rq = rq_of(cfs_rq);
+
+	rcu_read_lock();
+	is_idle = is_idle_task(rcu_dereference(rq->curr));
+	rcu_read_unlock();
+
+	/*
+	 * The lag estimation comes with a cost we don't want to pay all the
+	 * time. Hence, limiting to the case where the source CPU is idle and
+	 * we know we are at the greatest risk to have an outdated clock.
+	 */
+	if (!is_idle)
+		return;
+
+	last_update_lag = u64_u32_load(cfs_rq->last_update_lag);
+
+	/* The clock has been stopped for throttling */
+	if (last_update_lag == U64_MAX)
+		return;
+
+	now = se->avg.last_update_time - last_update_lag +
+	      sched_clock_cpu(cpu_of(rq));
+
+	__update_load_avg_blocked_se(now, se);
+}
+#else
+static void update_cfs_rq_lag(struct cfs_rq *cfs_rq) {}
+static void migrate_se_pelt_lag(struct sched_entity *se) {}
+#endif
+
 /**
  * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
  * @now: current time, as per cfs_rq_clock_pelt()
@@ -3774,6 +3825,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 			   cfs_rq->last_update_time_copy,
 			   sa->last_update_time);
 #endif
+	update_cfs_rq_lag(cfs_rq);
 
 	return decayed;
 }
@@ -6946,6 +6998,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
  */
 static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
+	struct sched_entity *se = &p->se;
+
 	/*
 	 * As blocked tasks retain absolute vruntime the migration needs to
 	 * deal with this by subtracting the old and adding the new
@@ -6953,7 +7007,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	 * the task on the new runqueue.
 	 */
 	if (READ_ONCE(p->__state) == TASK_WAKING) {
-		struct sched_entity *se = &p->se;
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
@@ -6965,25 +7018,28 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * rq->lock and can modify state directly.
 		 */
 		lockdep_assert_rq_held(task_rq(p));
-		detach_entity_cfs_rq(&p->se);
+		detach_entity_cfs_rq(se);
 
 	} else {
+		remove_entity_load_avg(se);
+
 		/*
-		 * We are supposed to update the task to "current" time, then
-		 * its up to date and ready to go to new CPU/cfs_rq. But we
-		 * have difficulty in getting what current time is, so simply
-		 * throw away the out-of-date time. This will result in the
-		 * wakee task is less decayed, but giving the wakee more load
-		 * sounds not bad.
+		 * Here, the task's PELT values have been updated according to
+		 * the current rq's clock. But if that clock hasn't been
+		 * updated in a while, a substantial idle time will be missed,
+		 * leading to an inflation after wake-up on the new rq.
+		 *
+		 * Estimate the missing time from the rq clock and update
+		 * sched_avg to improve the PELT continuity after migration.
 		 */
-		remove_entity_load_avg(&p->se);
+		migrate_se_pelt_lag(se);
 	}
 
 	/* Tell new CPU we are migrated */
-	p->se.avg.last_update_time = 0;
+	se->avg.last_update_time = 0;
 
 	/* We have migrated, no longer consider this task hot */
-	p->se.exec_start = 0;
+	se->exec_start = 0;
 
 	update_scan_period(p, new_cpu);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e2cf6e48b165..2f6446295e7d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -593,6 +593,12 @@ struct cfs_rq {
 	struct sched_avg	avg;
 #ifndef CONFIG_64BIT
 	u64			last_update_time_copy;
+#endif
+#ifdef CONFIG_NO_HZ_COMMON
+	u64			last_update_lag;
+#ifndef CONFIG_64BIT
+	u64                     last_update_lag_copy;
+#endif
 #endif
 	struct {
 		raw_spinlock_t	lock ____cacheline_aligned;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath, qperret

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).

A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.

Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index bca2f912d349..024dba4e6575 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -71,34 +71,19 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 
 static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
 {
-	unsigned long max = 0, sum_util = 0;
+	unsigned long max, sum_util = 0;
 	int cpu;
 
-	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-
-		/*
-		 * The capacity is the same for all CPUs belonging to
-		 * the same perf domain, so a single call to
-		 * arch_scale_cpu_capacity() is enough. However, we
-		 * need the CPU parameter to be initialized by the
-		 * loop, so the call ends up in this block.
-		 *
-		 * We can initialize 'max' with a cpumask_first() call
-		 * before the loop but the bits computation is not
-		 * worth given the arch_scale_cpu_capacity() just
-		 * returns a value where the resulting assembly code
-		 * will be optimized by the compiler.
-		 */
-		max = arch_scale_cpu_capacity(cpu);
-		sum_util += sched_cpu_util(cpu, max);
-	}
-
 	/*
-	 * In the improbable case where all the CPUs of the perf
-	 * domain are offline, 'max' will be zero and will lead to an
-	 * illegal operation with a zero division.
+	 * The capacity is the same for all CPUs belonging to
+	 * the same perf domain.
 	 */
-	return max ? (power * ((sum_util << 10) / max)) >> 10 : 0;
+	max = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+
+	for_each_cpu_and(cpu, pd_mask, cpu_online_mask)
+		sum_util += sched_cpu_util(cpu);
+
+	return (power * ((sum_util << 10) / max)) >> 10;
 }
 
 static u64 get_pd_power_uw(struct dtpm *dtpm)
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index 0bfb8eebd126..3f514ff3d9aa 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -137,11 +137,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
 		    int cpu_idx)
 {
-	unsigned long max = arch_scale_cpu_capacity(cpu);
-	unsigned long util;
+	unsigned long util = sched_cpu_util(cpu);
 
-	util = sched_cpu_util(cpu, max);
-	return (util * 100) / max;
+	return (util * 100) / arch_scale_cpu_capacity(cpu);
 }
 #else /* !CONFIG_SMP */
 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 67f06f72c50e..c1705effb3a4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2255,7 +2255,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 }
 
 /* Returns effective CPU energy utilization, as seen by the scheduler */
-unsigned long sched_cpu_util(int cpu, unsigned long max);
+unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 068c088e9584..a62d25ec5b0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7061,12 +7061,14 @@ struct task_struct *idle_task(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum cpu_util_type type,
+				 enum cpu_util_type type,
 				 struct task_struct *p)
 {
-	unsigned long dl_util, util, irq;
+	unsigned long dl_util, util, irq, max;
 	struct rq *rq = cpu_rq(cpu);
 
+	max = arch_scale_cpu_capacity(cpu);
+
 	if (!uclamp_is_used() &&
 	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
 		return max;
@@ -7146,10 +7148,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	return min(max, util);
 }
 
-unsigned long sched_cpu_util(int cpu, unsigned long max)
+unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), max,
-				  ENERGY_UTIL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3dbf351d12d5..1207c78f85c1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -157,11 +157,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
 
-	sg_cpu->max = max;
+	sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu);
 	sg_cpu->bw_dl = cpu_bw_dl(rq);
-	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
+	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
 					  FREQUENCY_UTIL, NULL);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e234d015657f..7cd1fb073fcf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6697,12 +6697,11 @@ static long
 compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 {
 	struct cpumask *pd_mask = perf_domain_span(pd);
-	unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
-	unsigned long max_util = 0, sum_util = 0;
-	unsigned long _cpu_cap = cpu_cap;
+	unsigned long max_util = 0, sum_util = 0, cpu_cap;
 	int cpu;
 
-	_cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+	cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
 
 	/*
 	 * The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6739,10 +6738,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
-					      ENERGY_UTIL, NULL);
+		cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
+					      NULL);
 
-		sum_util += min(cpu_util, _cpu_cap);
+		sum_util += min(cpu_util, cpu_cap);
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -6751,12 +6750,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
-					      FREQUENCY_UTIL, tsk);
-		max_util = max(max_util, min(cpu_util, _cpu_cap));
+		cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
+					      tsk);
+		max_util = max(max_util, min(cpu_util, cpu_cap));
 	}
 
-	return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
+	return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f6446295e7d..0b2eebbaa0d3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2874,7 +2874,7 @@ enum cpu_util_type {
 };
 
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum cpu_util_type type,
+				 enum cpu_util_type type,
 				 struct task_struct *p);
 
 static inline unsigned long cpu_bw_dl(struct rq *rq)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 4/7] sched/fair: Rename select_idle_mask to select_rq_mask
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
                   ` (2 preceding siblings ...)
  2022-04-12 13:42 ` [PATCH v4 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath, qperret

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).

This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62d25ec5b0d..f3f5540bae9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9456,7 +9456,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
-DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_rq_mask);
 
 void __init sched_init(void)
 {
@@ -9505,7 +9505,7 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
-		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+		per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7cd1fb073fcf..f1e78f6adc98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5848,7 +5848,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 
@@ -6338,7 +6338,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
@@ -6424,7 +6424,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
-	cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	task_util = uclamp_task_util(p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
                   ` (3 preceding siblings ...)
  2022-04-12 13:42 ` [PATCH v4 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath, qperret

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.

That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.

cpu_online_mask may change during this time which can lead to wrong
energy calculations.

To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.

The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1e78f6adc98..97eb8afb336c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6694,14 +6694,14 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
  * task.
  */
 static long
-compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
+compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
+	       struct perf_domain *pd)
 {
-	struct cpumask *pd_mask = perf_domain_span(pd);
 	unsigned long max_util = 0, sum_util = 0, cpu_cap;
 	int cpu;
 
-	cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
-	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+	cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
+	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
 
 	/*
 	 * The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6712,7 +6712,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	 * If an entire pd is outside of the current rd, it will not appear in
 	 * its pd list and will not be accounted by compute_energy().
 	 */
-	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+	for_each_cpu(cpu, cpus) {
 		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
 		unsigned long cpu_util, util_running = util_freq;
 		struct task_struct *tsk = NULL;
@@ -6799,6 +6799,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
  */
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
 	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
 	int cpu, best_energy_cpu = prev_cpu, target = -1;
@@ -6833,7 +6834,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		unsigned long base_energy_pd;
 		int max_spare_cap_cpu = -1;
 
-		for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
+
+		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
@@ -6870,12 +6873,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			continue;
 
 		/* Compute the 'base' energy of the pd, without @p */
-		base_energy_pd = compute_energy(p, -1, pd);
+		base_energy_pd = compute_energy(p, -1, cpus, pd);
 		base_energy += base_energy_pd;
 
 		/* Evaluate the energy impact of using prev_cpu. */
 		if (compute_prev_delta) {
-			prev_delta = compute_energy(p, prev_cpu, pd);
+			prev_delta = compute_energy(p, prev_cpu, cpus, pd);
 			if (prev_delta < base_energy_pd)
 				goto unlock;
 			prev_delta -= base_energy_pd;
@@ -6884,7 +6887,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 		/* Evaluate the energy impact of using max_spare_cap_cpu. */
 		if (max_spare_cap_cpu >= 0) {
-			cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+			cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
+						   pd);
 			if (cur_delta < base_energy_pd)
 				goto unlock;
 			cur_delta -= base_energy_pd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 6/7] sched/fair: Remove task_util from effective utilization in feec()
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
                   ` (4 preceding siblings ...)
  2022-04-12 13:42 ` [PATCH v4 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-12 13:42 ` [PATCH v4 7/7] sched/fair: Remove the energy margin " Vincent Donnefort
  2022-04-14 19:32 ` [PATCH v3 0/7] feec() energy margin removal Dietmar Eggemann
  7 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath,
	qperret, Vincent Donnefort

The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:

   effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.

If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:

  delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
                             effective_cpu_util(cpu_util))
        = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util

We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.

Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.

No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97eb8afb336c..d17ef80487e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6687,61 +6687,97 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 }
 
 /*
- * compute_energy(): Estimates the energy that @pd would consume if @p was
- * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of @pd's CPUs after the task migration, and uses the Energy Model
- * to compute what would be the energy if we decided to actually migrate that
- * task.
+ * energy_env - Utilization landscape for energy estimation.
+ * @task_busy_time: Utilization contribution by the task for which we test the
+ *                  placement. Given by eenv_task_busy_time().
+ * @pd_busy_time:   Utilization of the whole perf domain without the task
+ *                  contribution. Given by eenv_pd_busy_time().
+ * @cpu_cap:        Maximum CPU capacity for the perf domain.
+ * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
+ */
+struct energy_env {
+	unsigned long task_busy_time;
+	unsigned long pd_busy_time;
+	unsigned long cpu_cap;
+	unsigned long pd_cap;
+};
+
+/*
+ * Compute the task busy time for compute_energy(). This time cannot be
+ * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * The latter only makes sense with the most recent CPUs where the task has
+ * run.
  */
-static long
-compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
-	       struct perf_domain *pd)
+static inline void eenv_task_busy_time(struct energy_env *eenv,
+				       struct task_struct *p, int prev_cpu)
 {
-	unsigned long max_util = 0, sum_util = 0, cpu_cap;
+	unsigned long max_cap = arch_scale_cpu_capacity(prev_cpu);
+	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
+
+	if (unlikely(irq >= max_cap)) {
+		eenv->task_busy_time = max_cap;
+		return;
+	}
+
+	eenv->task_busy_time =
+		scale_irq_capacity(task_util_est(p), irq, max_cap);
+}
+
+/*
+ * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
+ * utilization for each @pd_cpus, it however doesn't take into account
+ * clamping since the ratio (utilization / cpu_capacity) is already enough to
+ * scale the EM reported power consumption at the (eventually clamped)
+ * cpu_capacity.
+ *
+ * The contribution of the task @p for which we want to estimate the
+ * energy cost is removed (by cpu_util_next()) and must be calculated
+ * separately (see eenv_task_busy_time). This ensures:
+ *
+ *   - A stable PD utilization, no matter which CPU of that PD we want to place
+ *     the task on.
+ *
+ *   - A fair comparison between CPUs as the task contribution (task_util())
+ *     will always be the same no matter which CPU utilization we rely on
+ *     (util_avg or util_est).
+ *
+ * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
+ * exceed @eenv->pd_cap.
+ */
+static inline void eenv_pd_busy_time(struct energy_env *eenv,
+				     struct cpumask *pd_cpus,
+				     struct task_struct *p)
+{
+	unsigned long busy_time = 0;
 	int cpu;
 
-	cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
-	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
+	for_each_cpu(cpu, pd_cpus) {
+		unsigned long util = cpu_util_next(cpu, p, -1);
 
-	/*
-	 * The capacity state of CPUs of the current rd can be driven by CPUs
-	 * of another rd if they belong to the same pd. So, account for the
-	 * utilization of these CPUs too by masking pd with cpu_online_mask
-	 * instead of the rd span.
-	 *
-	 * If an entire pd is outside of the current rd, it will not appear in
-	 * its pd list and will not be accounted by compute_energy().
-	 */
-	for_each_cpu(cpu, cpus) {
-		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
-		unsigned long cpu_util, util_running = util_freq;
-		struct task_struct *tsk = NULL;
+		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+	}
 
-		/*
-		 * When @p is placed on @cpu:
-		 *
-		 * util_running = max(cpu_util, cpu_util_est) +
-		 *		  max(task_util, _task_util_est)
-		 *
-		 * while cpu_util_next is: max(cpu_util + task_util,
-		 *			       cpu_util_est + _task_util_est)
-		 */
-		if (cpu == dst_cpu) {
-			tsk = p;
-			util_running =
-				cpu_util_next(cpu, p, -1) + task_util_est(p);
-		}
+	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
+}
 
-		/*
-		 * Busy time computation: utilization clamping is not
-		 * required since the ratio (sum_util / cpu_capacity)
-		 * is already enough to scale the EM reported power
-		 * consumption at the (eventually clamped) cpu_capacity.
-		 */
-		cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
-					      NULL);
+/*
+ * Compute the maximum utilization for compute_energy() when the task @p
+ * is placed on the cpu @dst_cpu.
+ *
+ * Returns the maximum utilization among @eenv->cpus. This utilization can't
+ * exceed @eenv->cpu_cap.
+ */
+static inline unsigned long
+eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
+		 struct task_struct *p, int dst_cpu)
+{
+	unsigned long max_util = 0;
+	int cpu;
 
-		sum_util += min(cpu_util, cpu_cap);
+	for_each_cpu(cpu, pd_cpus) {
+		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
+		unsigned long util = cpu_util_next(cpu, p, dst_cpu);
+		unsigned long cpu_util;
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -6750,12 +6786,30 @@ compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
-					      tsk);
-		max_util = max(max_util, min(cpu_util, cpu_cap));
+		cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+		max_util = max(max_util, cpu_util);
 	}
 
-	return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
+	return min(max_util, eenv->cpu_cap);
+}
+
+/*
+ * compute_energy(): Use the Energy Model to estimate the energy that @pd would
+ * consume for a given utilization landscape @eenv. If @dst_cpu < 0 the task
+ * contribution is removed from the energy estimation.
+ */
+static inline unsigned long
+compute_energy(struct energy_env *eenv, struct perf_domain *pd,
+	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+{
+	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
+	unsigned long busy_time = eenv->pd_busy_time;
+
+	if (dst_cpu >= 0)
+		busy_time = min(eenv->pd_cap,
+				eenv->pd_busy_time + eenv->task_busy_time);
+
+	return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
 }
 
 /*
@@ -6801,11 +6855,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
 	int cpu, best_energy_cpu = prev_cpu, target = -1;
-	unsigned long cpu_cap, util, base_energy = 0;
+	struct root_domain *rd = this_rq()->rd;
+	unsigned long base_energy = 0;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
+	struct energy_env eenv;
 
 	rcu_read_lock();
 	pd = rcu_dereference(rd->pd);
@@ -6828,22 +6883,36 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	if (!task_util_est(p))
 		goto unlock;
 
+	eenv_task_busy_time(&eenv, p, prev_cpu);
+
 	for (; pd; pd = pd->next) {
-		unsigned long cur_delta, spare_cap, max_spare_cap = 0;
+		unsigned long cpu_cap, cpu_thermal_cap, util;
+		unsigned long cur_delta, max_spare_cap = 0;
 		bool compute_prev_delta = false;
 		unsigned long base_energy_pd;
 		int max_spare_cap_cpu = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
-		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+		/* Account thermal pressure for the energy estimation */
+		cpu = cpumask_first(cpus);
+		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
+		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+
+		eenv.cpu_cap = cpu_thermal_cap;
+		eenv.pd_cap = 0;
+
+		for_each_cpu(cpu, cpus) {
+			eenv.pd_cap += cpu_thermal_cap;
+
+			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
+				continue;
+
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
 			util = cpu_util_next(cpu, p, cpu);
 			cpu_cap = capacity_of(cpu);
-			spare_cap = cpu_cap;
-			lsub_positive(&spare_cap, util);
 
 			/*
 			 * Skip CPUs that cannot satisfy the capacity request.
@@ -6856,15 +6925,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!fits_capacity(util, cpu_cap))
 				continue;
 
+			lsub_positive(&cpu_cap, util);
+
 			if (cpu == prev_cpu) {
 				/* Always use prev_cpu as a candidate. */
 				compute_prev_delta = true;
-			} else if (spare_cap > max_spare_cap) {
+			} else if (cpu_cap > max_spare_cap) {
 				/*
 				 * Find the CPU with the maximum spare capacity
 				 * in the performance domain.
 				 */
-				max_spare_cap = spare_cap;
+				max_spare_cap = cpu_cap;
 				max_spare_cap_cpu = cpu;
 			}
 		}
@@ -6873,12 +6944,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			continue;
 
 		/* Compute the 'base' energy of the pd, without @p */
-		base_energy_pd = compute_energy(p, -1, cpus, pd);
+		eenv_pd_busy_time(&eenv, cpus, p);
+		base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
 		base_energy += base_energy_pd;
 
 		/* Evaluate the energy impact of using prev_cpu. */
 		if (compute_prev_delta) {
-			prev_delta = compute_energy(p, prev_cpu, cpus, pd);
+			prev_delta = compute_energy(&eenv, pd, cpus, p,
+						    prev_cpu);
 			if (prev_delta < base_energy_pd)
 				goto unlock;
 			prev_delta -= base_energy_pd;
@@ -6887,8 +6960,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 		/* Evaluate the energy impact of using max_spare_cap_cpu. */
 		if (max_spare_cap_cpu >= 0) {
-			cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
-						   pd);
+			cur_delta = compute_energy(&eenv, pd, cpus, p,
+						   max_spare_cap_cpu);
 			if (cur_delta < base_energy_pd)
 				goto unlock;
 			cur_delta -= base_energy_pd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 7/7] sched/fair: Remove the energy margin in feec()
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
                   ` (5 preceding siblings ...)
  2022-04-12 13:42 ` [PATCH v4 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
@ 2022-04-12 13:42 ` Vincent Donnefort
  2022-04-14 19:32 ` [PATCH v3 0/7] feec() energy margin removal Dietmar Eggemann
  7 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-12 13:42 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, chris.redpath,
	qperret, Vincent Donnefort

find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:

1. The energy estimation is not a good absolute value:

compute_energy() used in feec() is a good estimation for task placement as
it allows to compare the energy with and without a task. The computed
delta will give a good overview of the cost for a certain task placement.
It, however, doesn't work as an absolute estimation for the total energy
of the system. First it adds the contribution to idle CPUs into the
energy, second it mixes util_avg with util_est values. util_avg contains
the near history for a CPU usage, it doesn't tell at all what the current
utilization is. A system that has been quite busy in the near past will
hold a very high energy and then a high margin preventing any task
migration to a lower capacity CPU, wasting energy. It even creates a
negative feedback loop: by holding the tasks on a less efficient CPU, the
margin contributes in keeping the energy high.

2. The margin handicaps small tasks:

On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.

The 6% of the overall system energy was brought by the following patch:

 (eb92692b2544 sched/fair: Speed-up energy-aware wake-ups)

It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:

 (8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary)

We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)

Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue.

Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d17ef80487e4..1779405c377b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6855,9 +6855,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	int cpu, best_energy_cpu = prev_cpu, target = -1;
 	struct root_domain *rd = this_rq()->rd;
-	unsigned long base_energy = 0;
+	int cpu, best_energy_cpu, target = -1;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
 	struct energy_env eenv;
@@ -6889,8 +6888,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		unsigned long cpu_cap, cpu_thermal_cap, util;
 		unsigned long cur_delta, max_spare_cap = 0;
 		bool compute_prev_delta = false;
-		unsigned long base_energy_pd;
 		int max_spare_cap_cpu = -1;
+		unsigned long base_energy;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -6945,16 +6944,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 		/* Compute the 'base' energy of the pd, without @p */
 		eenv_pd_busy_time(&eenv, cpus, p);
-		base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
-		base_energy += base_energy_pd;
+		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
 
 		/* Evaluate the energy impact of using prev_cpu. */
 		if (compute_prev_delta) {
 			prev_delta = compute_energy(&eenv, pd, cpus, p,
 						    prev_cpu);
-			if (prev_delta < base_energy_pd)
+			if (prev_delta < base_energy)
 				goto unlock;
-			prev_delta -= base_energy_pd;
+			prev_delta -= base_energy;
 			best_delta = min(best_delta, prev_delta);
 		}
 
@@ -6962,9 +6960,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		if (max_spare_cap_cpu >= 0) {
 			cur_delta = compute_energy(&eenv, pd, cpus, p,
 						   max_spare_cap_cpu);
-			if (cur_delta < base_energy_pd)
+			if (cur_delta < base_energy)
 				goto unlock;
-			cur_delta -= base_energy_pd;
+			cur_delta -= base_energy;
 			if (cur_delta < best_delta) {
 				best_delta = cur_delta;
 				best_energy_cpu = max_spare_cap_cpu;
@@ -6973,12 +6971,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	}
 	rcu_read_unlock();
 
-	/*
-	 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
-	 * least 6% of the energy used by prev_cpu.
-	 */
-	if ((prev_delta == ULONG_MAX) ||
-	    (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
+	if (best_delta < prev_delta)
 		target = best_energy_cpu;
 
 	return target;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 0/7] feec() energy margin removal
  2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
                   ` (6 preceding siblings ...)
  2022-04-12 13:42 ` [PATCH v4 7/7] sched/fair: Remove the energy margin " Vincent Donnefort
@ 2022-04-14 19:32 ` Dietmar Eggemann
  7 siblings, 0 replies; 19+ messages in thread
From: Dietmar Eggemann @ 2022-04-14 19:32 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, morten.rasmussen, chris.redpath, qperret

On 12/04/2022 15:42, Vincent Donnefort wrote:

Nitpick: s/v3/v4 in cover letter

> find_energy_efficient() (feec()) will migrate a task to save energy only

s/find_energy_efficient/find_energy_efficient_cpu

[...]

LGTM. For the whole patch-set:

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-12 13:42 ` [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration Vincent Donnefort
@ 2022-04-19 10:08   ` Vincent Guittot
  2022-04-19 12:23     ` Vincent Donnefort
  2022-04-22 15:14   ` Tao Zhou
  1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2022-04-19 10:08 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann, morten.rasmussen,
	chris.redpath, qperret

On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> Before being migrated to a new CPU, a task sees its PELT values
> synchronized with rq last_update_time. Once done, that same task will also
> have its sched_avg last_update_time reset. This means the time between
> the migration and the last clock update (B) will not be accounted for in
> util_avg and a discontinuity will appear. This issue is amplified by the
> PELT clock scaling. If the clock hasn't been updated while the CPU is
> idle, clock_pelt will not be aligned with clock_task and that time (A)
> will be also lost.
>
>    ---------|----- A -----|-----------|------- B -----|>
>         clock_pelt   clock_task     clock            now
>
> This is especially problematic for asymmetric CPU capacity systems which
> need stable util_avg signals for task placement and energy estimation.
>
> Ideally, this problem would be solved by updating the runqueue clocks
> before the migration. But that would require taking the runqueue lock
> which is quite expensive [1]. Instead estimate the missing time and update
> the task util_avg with that value:
>
>   A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>
> Neither clock_task, clock_pelt nor clock can be accessed without the
> runqueue lock. The new cfs_rq last_update_lag is therefore created and
> contains those three values when the last_update_time value for that very
> same cfs_rq is updated.
>
>   last_update_lag = clock - clock_task + clock_pelt
>
> And we can then write the missing time as follow:
>
>   A + B = sched_clock_cpu() - last_update_lag
>
> The B. part of the missing time is however an estimation that doesn't take
> into account IRQ and Paravirt time.
>
> Now we have an estimation for A + B, we can create an estimator for the
> PELT value at the time of the migration. We need for this purpose to
> inject last_update_time which is a combination of both clock_pelt and
> lost_idle_time. The latter is a time value which is completely lost from a
> PELT point of view and must be ignored. And finally, we can write:
>
>   now = last_update_time + A + B
>       = last_update_time + sched_clock_cpu() - last_update_lag
>
> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
> usage to the case where the source CPU is idle as we know this is when the
> clock is having the biggest risk of being outdated.
>
> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5dd38c9df0cc..e234d015657f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3694,6 +3694,57 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static inline void update_cfs_rq_lag(struct cfs_rq *cfs_rq)
> +{
> +       struct rq *rq = rq_of(cfs_rq);
> +
> +       u64_u32_store(cfs_rq->last_update_lag,
> +#ifdef CONFIG_CFS_BANDWIDTH
> +                     /* Timer stopped by throttling */
> +                     unlikely(cfs_rq->throttle_count) ? U64_MAX :
> +#endif
> +                     rq->clock - rq->clock_task + rq->clock_pelt);

I'm worried that we will call this for each and every
update_cfs_rq_load_avg() whereas the content will be used only when
idle and not throttled. Can't we use these conditions to save values
only when needed and limit the number of useless updates ?

A quick test with hackbench on a 8 cpus system gives
around 60k call for a duration 550 msec run a root level
and 180k from a 3rd level cgroups

> +}
> +
> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> +{
> +       u64 now, last_update_lag;
> +       struct cfs_rq *cfs_rq;
> +       struct rq *rq;
> +       bool is_idle;
> +
> +       cfs_rq = cfs_rq_of(se);
> +       rq = rq_of(cfs_rq);
> +
> +       rcu_read_lock();
> +       is_idle = is_idle_task(rcu_dereference(rq->curr));
> +       rcu_read_unlock();
> +
> +       /*
> +        * The lag estimation comes with a cost we don't want to pay all the
> +        * time. Hence, limiting to the case where the source CPU is idle and
> +        * we know we are at the greatest risk to have an outdated clock.
> +        */
> +       if (!is_idle)
> +               return;
> +
> +       last_update_lag = u64_u32_load(cfs_rq->last_update_lag);
> +
> +       /* The clock has been stopped for throttling */
> +       if (last_update_lag == U64_MAX)
> +               return;
> +
> +       now = se->avg.last_update_time - last_update_lag +
> +             sched_clock_cpu(cpu_of(rq));
> +
> +       __update_load_avg_blocked_se(now, se);
> +}
> +#else
> +static void update_cfs_rq_lag(struct cfs_rq *cfs_rq) {}
> +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> +#endif
> +
>  /**
>   * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
>   * @now: current time, as per cfs_rq_clock_pelt()
> @@ -3774,6 +3825,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                            cfs_rq->last_update_time_copy,
>                            sa->last_update_time);
>  #endif
> +       update_cfs_rq_lag(cfs_rq);
>
>         return decayed;
>  }
> @@ -6946,6 +6998,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
>   */
>  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  {
> +       struct sched_entity *se = &p->se;
> +
>         /*
>          * As blocked tasks retain absolute vruntime the migration needs to
>          * deal with this by subtracting the old and adding the new
> @@ -6953,7 +7007,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>          * the task on the new runqueue.
>          */
>         if (READ_ONCE(p->__state) == TASK_WAKING) {
> -               struct sched_entity *se = &p->se;
>                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>                 se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> @@ -6965,25 +7018,28 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>                  * rq->lock and can modify state directly.
>                  */
>                 lockdep_assert_rq_held(task_rq(p));
> -               detach_entity_cfs_rq(&p->se);
> +               detach_entity_cfs_rq(se);
>
>         } else {
> +               remove_entity_load_avg(se);
> +
>                 /*
> -                * We are supposed to update the task to "current" time, then
> -                * its up to date and ready to go to new CPU/cfs_rq. But we
> -                * have difficulty in getting what current time is, so simply
> -                * throw away the out-of-date time. This will result in the
> -                * wakee task is less decayed, but giving the wakee more load
> -                * sounds not bad.
> +                * Here, the task's PELT values have been updated according to
> +                * the current rq's clock. But if that clock hasn't been
> +                * updated in a while, a substantial idle time will be missed,
> +                * leading to an inflation after wake-up on the new rq.
> +                *
> +                * Estimate the missing time from the rq clock and update
> +                * sched_avg to improve the PELT continuity after migration.
>                  */
> -               remove_entity_load_avg(&p->se);
> +               migrate_se_pelt_lag(se);
>         }
>
>         /* Tell new CPU we are migrated */
> -       p->se.avg.last_update_time = 0;
> +       se->avg.last_update_time = 0;
>
>         /* We have migrated, no longer consider this task hot */
> -       p->se.exec_start = 0;
> +       se->exec_start = 0;
>
>         update_scan_period(p, new_cpu);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e2cf6e48b165..2f6446295e7d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -593,6 +593,12 @@ struct cfs_rq {
>         struct sched_avg        avg;
>  #ifndef CONFIG_64BIT
>         u64                     last_update_time_copy;
> +#endif
> +#ifdef CONFIG_NO_HZ_COMMON
> +       u64                     last_update_lag;
> +#ifndef CONFIG_64BIT
> +       u64                     last_update_lag_copy;
> +#endif
>  #endif
>         struct {
>                 raw_spinlock_t  lock ____cacheline_aligned;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-19 10:08   ` Vincent Guittot
@ 2022-04-19 12:23     ` Vincent Donnefort
  2022-04-19 16:27       ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-19 12:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann, morten.rasmussen,
	chris.redpath, qperret



On 19/04/2022 11:08, Vincent Guittot wrote:
> On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
> <vincent.donnefort@arm.com> wrote:
>>
>> Before being migrated to a new CPU, a task sees its PELT values
>> synchronized with rq last_update_time. Once done, that same task will also
>> have its sched_avg last_update_time reset. This means the time between
>> the migration and the last clock update (B) will not be accounted for in
>> util_avg and a discontinuity will appear. This issue is amplified by the
>> PELT clock scaling. If the clock hasn't been updated while the CPU is
>> idle, clock_pelt will not be aligned with clock_task and that time (A)
>> will be also lost.
>>
>>     ---------|----- A -----|-----------|------- B -----|>
>>          clock_pelt   clock_task     clock            now
>>
>> This is especially problematic for asymmetric CPU capacity systems which
>> need stable util_avg signals for task placement and energy estimation.
>>
>> Ideally, this problem would be solved by updating the runqueue clocks
>> before the migration. But that would require taking the runqueue lock
>> which is quite expensive [1]. Instead estimate the missing time and update
>> the task util_avg with that value:
>>
>>    A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>>
>> Neither clock_task, clock_pelt nor clock can be accessed without the
>> runqueue lock. The new cfs_rq last_update_lag is therefore created and
>> contains those three values when the last_update_time value for that very
>> same cfs_rq is updated.
>>
>>    last_update_lag = clock - clock_task + clock_pelt
>>
>> And we can then write the missing time as follow:
>>
>>    A + B = sched_clock_cpu() - last_update_lag
>>
>> The B. part of the missing time is however an estimation that doesn't take
>> into account IRQ and Paravirt time.
>>
>> Now we have an estimation for A + B, we can create an estimator for the
>> PELT value at the time of the migration. We need for this purpose to
>> inject last_update_time which is a combination of both clock_pelt and
>> lost_idle_time. The latter is a time value which is completely lost from a
>> PELT point of view and must be ignored. And finally, we can write:
>>
>>    now = last_update_time + A + B
>>        = last_update_time + sched_clock_cpu() - last_update_lag
>>
>> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
>> usage to the case where the source CPU is idle as we know this is when the
>> clock is having the biggest risk of being outdated.
>>
>> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
>>
>> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5dd38c9df0cc..e234d015657f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3694,6 +3694,57 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>>
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +static inline void update_cfs_rq_lag(struct cfs_rq *cfs_rq)
>> +{
>> +       struct rq *rq = rq_of(cfs_rq);
>> +
>> +       u64_u32_store(cfs_rq->last_update_lag,
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +                     /* Timer stopped by throttling */
>> +                     unlikely(cfs_rq->throttle_count) ? U64_MAX :
>> +#endif
>> +                     rq->clock - rq->clock_task + rq->clock_pelt);
> 
> I'm worried that we will call this for each and every
> update_cfs_rq_load_avg() whereas the content will be used only when
> idle and not throttled. Can't we use these conditions to save values
> only when needed and limit the number of useless updates ?

I don't think we can use idle here as a condition, once it is idle, it 
is too late to get those clock values.
> 
> A quick test with hackbench on a 8 cpus system gives
> around 60k call for a duration 550 msec run a root level
> and 180k from a 3rd level cgroups

If you think this is too much to have this store on this path, we could 
though either:

a. restrict the feature to when we know it is the most important: EAS?

b. store the updated value only when "decayed" is non 0.

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-04-12 13:42 ` [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
@ 2022-04-19 12:42   ` Tao Zhou
  2022-04-22 11:10     ` Vincent Donnefort
  0 siblings, 1 reply; 19+ messages in thread
From: Tao Zhou @ 2022-04-19 12:42 UTC (permalink / raw)
  To: Vincent Donnefort, Tao Zhou
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	morten.rasmussen, chris.redpath, qperret

On Tue, Apr 12, 2022 at 02:42:14PM +0100,
Vincent Donnefort wrote:
> Introducing macro helpers u64_u32_{store,load}() to factorize lockless
> accesses to u64 variables for 32-bits architectures.
> 
> Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
> accommodate the later where the copy lies outside of the structure
> (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
> use the _copy() version of those helpers.
> 
> Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
> therefore, have a small penalty in set_task_rq_fair() and init_cfs_rq().
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3eba0dcc4962..5dd38c9df0cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -600,11 +600,8 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  	}
>  
>  	/* ensure we never gain time by being placed backwards. */
> -	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
> -#ifndef CONFIG_64BIT
> -	smp_wmb();
> -	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
> -#endif
> +	u64_u32_store(cfs_rq->min_vruntime,
> +		      max_vruntime(cfs_rq->min_vruntime, vruntime));
>  }
>  
>  static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
> @@ -3301,6 +3298,11 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
>  }
>  
>  #ifdef CONFIG_SMP
> +static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> +{
> +	return u64_u32_load_copy(cfs_rq->avg.last_update_time,
> +				 cfs_rq->last_update_time_copy);
> +}
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
>   * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
> @@ -3411,27 +3413,9 @@ void set_task_rq_fair(struct sched_entity *se,
>  	if (!(se->avg.last_update_time && prev))
>  		return;
>  
> -#ifndef CONFIG_64BIT
> -	{
> -		u64 p_last_update_time_copy;
> -		u64 n_last_update_time_copy;
> -
> -		do {
> -			p_last_update_time_copy = prev->load_last_update_time_copy;
> -			n_last_update_time_copy = next->load_last_update_time_copy;
> -
> -			smp_rmb();
> +	p_last_update_time = cfs_rq_last_update_time(prev);
> +	n_last_update_time = cfs_rq_last_update_time(next);

Seperate the load of prev cfs rq and next cfs rq seems not wrong to me.

> -			p_last_update_time = prev->avg.last_update_time;
> -			n_last_update_time = next->avg.last_update_time;
> -
> -		} while (p_last_update_time != p_last_update_time_copy ||
> -			 n_last_update_time != n_last_update_time_copy);
> -	}
> -#else
> -	p_last_update_time = prev->avg.last_update_time;
> -	n_last_update_time = next->avg.last_update_time;
> -#endif
>  	__update_load_avg_blocked_se(p_last_update_time, se);
>  	se->avg.last_update_time = n_last_update_time;
>  }
> @@ -3786,8 +3770,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
>  
>  #ifndef CONFIG_64BIT
> -	smp_wmb();
> -	cfs_rq->load_last_update_time_copy = sa->last_update_time;
> +	u64_u32_store_copy(sa->last_update_time,
> +			   cfs_rq->last_update_time_copy,
> +			   sa->last_update_time);
>  #endif
>  
>  	return decayed;
> @@ -3921,27 +3906,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	}
>  }
>  
> -#ifndef CONFIG_64BIT
> -static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> -{
> -	u64 last_update_time_copy;
> -	u64 last_update_time;
> -
> -	do {
> -		last_update_time_copy = cfs_rq->load_last_update_time_copy;
> -		smp_rmb();
> -		last_update_time = cfs_rq->avg.last_update_time;
> -	} while (last_update_time != last_update_time_copy);
> -
> -	return last_update_time;
> -}
> -#else
> -static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> -{
> -	return cfs_rq->avg.last_update_time;
> -}
> -#endif
> -
>  /*
>   * Synchronize entity load avg of dequeued entity without locking
>   * the previous rq.
> @@ -6991,21 +6955,8 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  	if (READ_ONCE(p->__state) == TASK_WAKING) {
>  		struct sched_entity *se = &p->se;
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -		u64 min_vruntime;
>  
> -#ifndef CONFIG_64BIT
> -		u64 min_vruntime_copy;
> -
> -		do {
> -			min_vruntime_copy = cfs_rq->min_vruntime_copy;
> -			smp_rmb();
> -			min_vruntime = cfs_rq->min_vruntime;
> -		} while (min_vruntime != min_vruntime_copy);
> -#else
> -		min_vruntime = cfs_rq->min_vruntime;
> -#endif
> -
> -		se->vruntime -= min_vruntime;
> +		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
>  	}
>  
>  	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
> @@ -11453,10 +11404,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>  void init_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> -	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
> -#ifndef CONFIG_64BIT
> -	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
> -#endif
> +	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
>  #ifdef CONFIG_SMP
>  	raw_spin_lock_init(&cfs_rq->removed.lock);
>  #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 762be73972bd..e2cf6e48b165 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -513,6 +513,45 @@ struct cfs_bandwidth { };
>  
>  #endif	/* CONFIG_CGROUP_SCHED */
>  
> +/*
> + * u64_u32_load/u64_u32_store
> + *
> + * Use a copy of a u64 value to protect against data race. This is only
> + * applicable for 32-bits architectures.
> + */
> +#ifdef CONFIG_64BIT
> +# define u64_u32_load_copy(var, copy)       var
> +# define u64_u32_store_copy(var, copy, val) (var = val)
> +#else
> +# define u64_u32_load_copy(var, copy)					\
> +({									\
> +	u64 __val, __val_copy;						\
> +	do {								\
> +		__val_copy = copy;					\
> +		/*							\
> +		 * paired with u64_u32_store, ordering access		\
> +		 * to var and copy.					\
> +		 */							\
> +		smp_rmb();						\
> +		__val = var;						\
> +	} while (__val != __val_copy);					\
> +	__val;								\
> +})
> +# define u64_u32_store_copy(var, copy, val)				\
> +do {									\
> +	typeof(val) __val = (val);					\
> +	var = __val;							\
> +	/*								\
> +	 * paired with u64_u32_load, ordering access to var and		\
> +	 * copy.							\
> +	 */								\
> +	smp_wmb();							\
> +	copy = __val;							\

`copy = __val;` should be `copy = var`.

If var equal to val we do not need to do store. Check this condition
in the above macro to avoid a redundant store.

 if (var != __val)
   var = __val;

Catching up a little and droping a little..

Thanks,
Tao
> +} while (0)
> +#endif
> +# define u64_u32_load(var)      u64_u32_load_copy(var, var##_copy)
> +# define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
> +
>  /* CFS-related fields in a runqueue */
>  struct cfs_rq {
>  	struct load_weight	load;
> @@ -553,7 +592,7 @@ struct cfs_rq {
>  	 */
>  	struct sched_avg	avg;
>  #ifndef CONFIG_64BIT
> -	u64			load_last_update_time_copy;
> +	u64			last_update_time_copy;
>  #endif
>  	struct {
>  		raw_spinlock_t	lock ____cacheline_aligned;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-19 12:23     ` Vincent Donnefort
@ 2022-04-19 16:27       ` Vincent Guittot
  2022-04-20  9:34         ` Vincent Donnefort
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2022-04-19 16:27 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann, morten.rasmussen,
	chris.redpath, qperret

Le mardi 19 avril 2022 à 13:23:27 (+0100), Vincent Donnefort a écrit :
> 
> 
> On 19/04/2022 11:08, Vincent Guittot wrote:
> > On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
> > <vincent.donnefort@arm.com> wrote:
> > > 
> > > Before being migrated to a new CPU, a task sees its PELT values
> > > synchronized with rq last_update_time. Once done, that same task will also
> > > have its sched_avg last_update_time reset. This means the time between
> > > the migration and the last clock update (B) will not be accounted for in
> > > util_avg and a discontinuity will appear. This issue is amplified by the
> > > PELT clock scaling. If the clock hasn't been updated while the CPU is
> > > idle, clock_pelt will not be aligned with clock_task and that time (A)
> > > will be also lost.
> > > 
> > >     ---------|----- A -----|-----------|------- B -----|>
> > >          clock_pelt   clock_task     clock            now
> > > 
> > > This is especially problematic for asymmetric CPU capacity systems which
> > > need stable util_avg signals for task placement and energy estimation.
> > > 
> > > Ideally, this problem would be solved by updating the runqueue clocks
> > > before the migration. But that would require taking the runqueue lock
> > > which is quite expensive [1]. Instead estimate the missing time and update
> > > the task util_avg with that value:
> > > 
> > >    A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
> > > 
> > > Neither clock_task, clock_pelt nor clock can be accessed without the
> > > runqueue lock. The new cfs_rq last_update_lag is therefore created and
> > > contains those three values when the last_update_time value for that very
> > > same cfs_rq is updated.
> > > 
> > >    last_update_lag = clock - clock_task + clock_pelt
> > > 
> > > And we can then write the missing time as follow:
> > > 
> > >    A + B = sched_clock_cpu() - last_update_lag
> > > 
> > > The B. part of the missing time is however an estimation that doesn't take
> > > into account IRQ and Paravirt time.
> > > 
> > > Now we have an estimation for A + B, we can create an estimator for the
> > > PELT value at the time of the migration. We need for this purpose to
> > > inject last_update_time which is a combination of both clock_pelt and
> > > lost_idle_time. The latter is a time value which is completely lost from a
> > > PELT point of view and must be ignored. And finally, we can write:
> > > 
> > >    now = last_update_time + A + B
> > >        = last_update_time + sched_clock_cpu() - last_update_lag
> > > 
> > > This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
> > > usage to the case where the source CPU is idle as we know this is when the
> > > clock is having the biggest risk of being outdated.
> > > 
> > > [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
> > > 
> > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

[...]

> > 
> > I'm worried that we will call this for each and every
> > update_cfs_rq_load_avg() whereas the content will be used only when
> > idle and not throttled. Can't we use these conditions to save values
> > only when needed and limit the number of useless updates ?
> 
> I don't think we can use idle here as a condition, once it is idle, it is
> too late to get those clock values.

As an example, the patch below should work. It doesn't handle the throttled case yet and still has to
make sure that rq->enter_idle and rq->clock_pelt_idle are coherent in regards to ILB that
update blocked load.

---
 kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
 kernel/sched/pelt.h  | 21 ++++++++++++++-------
 kernel/sched/sched.h |  3 ++-
 3 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6ecf530f19f..f00843f9dd01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7005,6 +7005,35 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 
 static void detach_entity_cfs_rq(struct sched_entity *se);
 
+#ifdef CONFIG_NO_HZ_COMMON
+static inline void migrate_se_pelt_lag(struct sched_entity *se)
+{
+       u64 now;
+       struct cfs_rq *cfs_rq;
+       struct rq *rq;
+       bool is_idle;
+
+       cfs_rq = cfs_rq_of(se);
+       rq = rq_of(cfs_rq);
+
+       rcu_read_lock();
+       is_idle = is_idle_task(rcu_dereference(rq->curr));
+       rcu_read_unlock();
+
+       if (!is_idle)
+               return;
+
+	/* TODO handle throttled cfs */
+	/* TODO handle update ilb blocked load update */
+	now = READ_ONCE(rq->clock_pelt_idle);
+	now += sched_clock_cpu(cpu_of(rq)) - READ_ONCE(rq->enter_idle);
+
+       __update_load_avg_blocked_se(now, se);
+}
+#else
+static void migrate_se_pelt_lag(struct sched_entity *se) {}
+#endif
+
 /*
  * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -7056,6 +7085,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * sounds not bad.
 		 */
 		remove_entity_load_avg(&p->se);
+		migrate_se_pelt_lag(&p->se);
 	}
 
 	/* Tell new CPU we are migrated */
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index c336f5f481bc..ece4423026e5 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
 	WRITE_ONCE(avg->util_est.enqueued, enqueued);
 }
 
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+	assert_clock_updated(rq);
+
+	return rq->clock_pelt - rq->lost_idle_time;
+}
+
 /*
  * The clock_pelt scales the time to reflect the effective amount of
  * computation done during the running delta time but then sync back to
@@ -78,6 +86,8 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 	if (unlikely(is_idle_task(rq->curr))) {
 		/* The rq is idle, we can sync to clock_task */
 		rq->clock_pelt  = rq_clock_task(rq);
+		WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
+		WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
 		return;
 	}
 
@@ -130,14 +140,11 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 	 */
 	if (util_sum >= divider)
 		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
-}
 
-static inline u64 rq_clock_pelt(struct rq *rq)
-{
-	lockdep_assert_rq_held(rq);
-	assert_clock_updated(rq);
-
-	return rq->clock_pelt - rq->lost_idle_time;
+	/* The rq is idle, we can sync with clock_task */
+	rq->clock_pelt  = rq_clock_task(rq);
+	WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
+	WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ab77b171656..108698446762 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1007,7 +1007,8 @@ struct rq {
 	u64			clock_task ____cacheline_aligned;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
-
+	u64			clock_pelt_idle;
+	u64			enter_idle;
 	atomic_t		nr_iowait;
 
 #ifdef CONFIG_SCHED_DEBUG
-- 
2.17.1

> > 
> > A quick test with hackbench on a 8 cpus system gives
> > around 60k call for a duration 550 msec run a root level
> > and 180k from a 3rd level cgroups
> 
> If you think this is too much to have this store on this path, we could
> though either:
> 
> a. restrict the feature to when we know it is the most important: EAS?
> 
> b. store the updated value only when "decayed" is non 0.
> 
> [...]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-19 16:27       ` Vincent Guittot
@ 2022-04-20  9:34         ` Vincent Donnefort
  2022-04-20 10:25           ` Vincent Donnefort
  2022-04-21  7:36           ` Vincent Guittot
  0 siblings, 2 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-20  9:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann, morten.rasmussen,
	chris.redpath, qperret

On 19/04/2022 17:27, Vincent Guittot wrote:
> Le mardi 19 avril 2022 � 13:23:27 (+0100), Vincent Donnefort a �crit :
>>
>>
>> On 19/04/2022 11:08, Vincent Guittot wrote:
>>> On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
>>> <vincent.donnefort@arm.com> wrote:
>>>>
>>>> Before being migrated to a new CPU, a task sees its PELT values
>>>> synchronized with rq last_update_time. Once done, that same task will also
>>>> have its sched_avg last_update_time reset. This means the time between
>>>> the migration and the last clock update (B) will not be accounted for in
>>>> util_avg and a discontinuity will appear. This issue is amplified by the
>>>> PELT clock scaling. If the clock hasn't been updated while the CPU is
>>>> idle, clock_pelt will not be aligned with clock_task and that time (A)
>>>> will be also lost.
>>>>
>>>>      ---------|----- A -----|-----------|------- B -----|>
>>>>           clock_pelt   clock_task     clock            now
>>>>
>>>> This is especially problematic for asymmetric CPU capacity systems which
>>>> need stable util_avg signals for task placement and energy estimation.
>>>>
>>>> Ideally, this problem would be solved by updating the runqueue clocks
>>>> before the migration. But that would require taking the runqueue lock
>>>> which is quite expensive [1]. Instead estimate the missing time and update
>>>> the task util_avg with that value:
>>>>
>>>>     A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>>>>
>>>> Neither clock_task, clock_pelt nor clock can be accessed without the
>>>> runqueue lock. The new cfs_rq last_update_lag is therefore created and
>>>> contains those three values when the last_update_time value for that very
>>>> same cfs_rq is updated.
>>>>
>>>>     last_update_lag = clock - clock_task + clock_pelt
>>>>
>>>> And we can then write the missing time as follow:
>>>>
>>>>     A + B = sched_clock_cpu() - last_update_lag
>>>>
>>>> The B. part of the missing time is however an estimation that doesn't take
>>>> into account IRQ and Paravirt time.
>>>>
>>>> Now we have an estimation for A + B, we can create an estimator for the
>>>> PELT value at the time of the migration. We need for this purpose to
>>>> inject last_update_time which is a combination of both clock_pelt and
>>>> lost_idle_time. The latter is a time value which is completely lost from a
>>>> PELT point of view and must be ignored. And finally, we can write:
>>>>
>>>>     now = last_update_time + A + B
>>>>         = last_update_time + sched_clock_cpu() - last_update_lag
>>>>
>>>> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
>>>> usage to the case where the source CPU is idle as we know this is when the
>>>> clock is having the biggest risk of being outdated.
>>>>
>>>> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
>>>>
>>>> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> [...]
> 
>>>
>>> I'm worried that we will call this for each and every
>>> update_cfs_rq_load_avg() whereas the content will be used only when
>>> idle and not throttled. Can't we use these conditions to save values
>>> only when needed and limit the number of useless updates ?
>>
>> I don't think we can use idle here as a condition, once it is idle, it is
>> too late to get those clock values.
> 
> As an example, the patch below should work. It doesn't handle the throttled case yet and still has to
> make sure that rq->enter_idle and rq->clock_pelt_idle are coherent in regards to ILB that
> update blocked load.


I had to abandon the per-rq approach from v1 to v2. This is because of 
the following example:

1. task A sleep
2. rq's clock updated (e.g another task runs)
3. task A migrated

With a per-rq lag, we would miss the time delta between 1 and 2. We know 
how old is the last clock update. But what we actually want is how old 
is the task's last_update_time.

> 
> ---
>   kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
>   kernel/sched/pelt.h  | 21 ++++++++++++++-------
>   kernel/sched/sched.h |  3 ++-
>   3 files changed, 46 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e6ecf530f19f..f00843f9dd01 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7005,6 +7005,35 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   
>   static void detach_entity_cfs_rq(struct sched_entity *se);
>   
> +#ifdef CONFIG_NO_HZ_COMMON
> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> +{
> +       u64 now;
> +       struct cfs_rq *cfs_rq;
> +       struct rq *rq;
> +       bool is_idle;
> +
> +       cfs_rq = cfs_rq_of(se);
> +       rq = rq_of(cfs_rq);
> +
> +       rcu_read_lock();
> +       is_idle = is_idle_task(rcu_dereference(rq->curr));
> +       rcu_read_unlock();
> +
> +       if (!is_idle)
> +               return;
> +
> +	/* TODO handle throttled cfs */
> +	/* TODO handle update ilb blocked load update */
> +	now = READ_ONCE(rq->clock_pelt_idle);
> +	now += sched_clock_cpu(cpu_of(rq)) - READ_ONCE(rq->enter_idle);
> +
> +       __update_load_avg_blocked_se(now, se);
> +}
> +#else
> +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> +#endif
> +
>   /*
>    * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
>    * cfs_rq_of(p) references at time of call are still valid and identify the
> @@ -7056,6 +7085,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>   		 * sounds not bad.
>   		 */
>   		remove_entity_load_avg(&p->se);
> +		migrate_se_pelt_lag(&p->se);
>   	}
>   
>   	/* Tell new CPU we are migrated */
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index c336f5f481bc..ece4423026e5 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
>   	WRITE_ONCE(avg->util_est.enqueued, enqueued);
>   }
>   
> +static inline u64 rq_clock_pelt(struct rq *rq)
> +{
> +	lockdep_assert_rq_held(rq);
> +	assert_clock_updated(rq);
> +
> +	return rq->clock_pelt - rq->lost_idle_time;
> +}
> +
>   /*
>    * The clock_pelt scales the time to reflect the effective amount of
>    * computation done during the running delta time but then sync back to
> @@ -78,6 +86,8 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
>   	if (unlikely(is_idle_task(rq->curr))) {
>   		/* The rq is idle, we can sync to clock_task */
>   		rq->clock_pelt  = rq_clock_task(rq);
> +		WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
> +		WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
>   		return;
>   	}
>   
> @@ -130,14 +140,11 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
>   	 */
>   	if (util_sum >= divider)
>   		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> -}
>   
> -static inline u64 rq_clock_pelt(struct rq *rq)
> -{
> -	lockdep_assert_rq_held(rq);
> -	assert_clock_updated(rq);
> -
> -	return rq->clock_pelt - rq->lost_idle_time;
> +	/* The rq is idle, we can sync with clock_task */
> +	rq->clock_pelt  = rq_clock_task(rq);
> +	WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
> +	WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
>   }
>   
>   #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6ab77b171656..108698446762 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1007,7 +1007,8 @@ struct rq {
>   	u64			clock_task ____cacheline_aligned;
>   	u64			clock_pelt;
>   	unsigned long		lost_idle_time;
> -
> +	u64			clock_pelt_idle;
> +	u64			enter_idle;
>   	atomic_t		nr_iowait;
>   
>   #ifdef CONFIG_SCHED_DEBUG

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-20  9:34         ` Vincent Donnefort
@ 2022-04-20 10:25           ` Vincent Donnefort
  2022-04-21  7:36           ` Vincent Guittot
  1 sibling, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-20 10:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann, morten.rasmussen,
	chris.redpath, qperret



On 20/04/2022 10:34, Vincent Donnefort wrote:
> On 19/04/2022 17:27, Vincent Guittot wrote:
>> Le mardi 19 avril 2022 � 13:23:27 (+0100), Vincent Donnefort a 
>> �crit :
>>>
>>>
>>> On 19/04/2022 11:08, Vincent Guittot wrote:
>>>> On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
>>>> <vincent.donnefort@arm.com> wrote:
>>>>>
>>>>> Before being migrated to a new CPU, a task sees its PELT values
>>>>> synchronized with rq last_update_time. Once done, that same task 
>>>>> will also
>>>>> have its sched_avg last_update_time reset. This means the time between
>>>>> the migration and the last clock update (B) will not be accounted 
>>>>> for in
>>>>> util_avg and a discontinuity will appear. This issue is amplified 
>>>>> by the
>>>>> PELT clock scaling. If the clock hasn't been updated while the CPU is
>>>>> idle, clock_pelt will not be aligned with clock_task and that time (A)
>>>>> will be also lost.
>>>>>
>>>>>      ---------|----- A -----|-----------|------- B -----|>
>>>>>           clock_pelt   clock_task     clock            now
>>>>>
>>>>> This is especially problematic for asymmetric CPU capacity systems 
>>>>> which
>>>>> need stable util_avg signals for task placement and energy estimation.
>>>>>
>>>>> Ideally, this problem would be solved by updating the runqueue clocks
>>>>> before the migration. But that would require taking the runqueue lock
>>>>> which is quite expensive [1]. Instead estimate the missing time and 
>>>>> update
>>>>> the task util_avg with that value:
>>>>>
>>>>>     A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>>>>>
>>>>> Neither clock_task, clock_pelt nor clock can be accessed without the
>>>>> runqueue lock. The new cfs_rq last_update_lag is therefore created and
>>>>> contains those three values when the last_update_time value for 
>>>>> that very
>>>>> same cfs_rq is updated.
>>>>>
>>>>>     last_update_lag = clock - clock_task + clock_pelt
>>>>>
>>>>> And we can then write the missing time as follow:
>>>>>
>>>>>     A + B = sched_clock_cpu() - last_update_lag
>>>>>
>>>>> The B. part of the missing time is however an estimation that 
>>>>> doesn't take
>>>>> into account IRQ and Paravirt time.
>>>>>
>>>>> Now we have an estimation for A + B, we can create an estimator for 
>>>>> the
>>>>> PELT value at the time of the migration. We need for this purpose to
>>>>> inject last_update_time which is a combination of both clock_pelt and
>>>>> lost_idle_time. The latter is a time value which is completely lost 
>>>>> from a
>>>>> PELT point of view and must be ignored. And finally, we can write:
>>>>>
>>>>>     now = last_update_time + A + B
>>>>>         = last_update_time + sched_clock_cpu() - last_update_lag
>>>>>
>>>>> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
>>>>> usage to the case where the source CPU is idle as we know this is 
>>>>> when the
>>>>> clock is having the biggest risk of being outdated.
>>>>>
>>>>> [1] 
>>>>> https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/ 
>>>>>
>>>>>
>>>>> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>>
>> [...]
>>
>>>>
>>>> I'm worried that we will call this for each and every
>>>> update_cfs_rq_load_avg() whereas the content will be used only when
>>>> idle and not throttled. Can't we use these conditions to save values
>>>> only when needed and limit the number of useless updates ?
>>>
>>> I don't think we can use idle here as a condition, once it is idle, 
>>> it is
>>> too late to get those clock values.
>>
>> As an example, the patch below should work. It doesn't handle the 
>> throttled case yet and still has to
>> make sure that rq->enter_idle and rq->clock_pelt_idle are coherent in 
>> regards to ILB that
>> update blocked load.
> 
> 
> I had to abandon the per-rq approach from v1 to v2. This is because of 
> the following example:
> 
> 1. task A sleep
> 2. rq's clock updated (e.g another task runs)
> 3. task A migrated
> 
> With a per-rq lag, we would miss the time delta between 1 and 2. We know 
> how old is the last clock update. But what we actually want is how old 
> is the task's last_update_time.


However, to go back to your idea of only updating when going idle: we 
could only do the update when the task is dequeued. We know that only 
after that dequeue we could have a wake-up migration.

Something like:

   static void
   dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int 
flags)

...

           update_load_avg(cfs_rq, se, UPDATE_TG);
+         if (entity_is_task(se))
+                  update_cfs_rq_lag(cfs_rq);
           se_update_runnable(se);


That would drastically reduce the number of time update_cfs_rq_lag() 
would be called, skipping the update during enqueue and tick.

Additionally, we could also check for nr_running in cfs_rq, to only 
update when cfs_rq is 'idle'?


Thoughts?


> 
>>
>> ---
>>   kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
>>   kernel/sched/pelt.h  | 21 ++++++++++++++-------
>>   kernel/sched/sched.h |  3 ++-
>>   3 files changed, 46 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e6ecf530f19f..f00843f9dd01 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7005,6 +7005,35 @@ select_task_rq_fair(struct task_struct *p, int 
>> prev_cpu, int wake_flags)
>>   static void detach_entity_cfs_rq(struct sched_entity *se);
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
>> +{
>> +       u64 now;
>> +       struct cfs_rq *cfs_rq;
>> +       struct rq *rq;
>> +       bool is_idle;
>> +
>> +       cfs_rq = cfs_rq_of(se);
>> +       rq = rq_of(cfs_rq);
>> +
>> +       rcu_read_lock();
>> +       is_idle = is_idle_task(rcu_dereference(rq->curr));
>> +       rcu_read_unlock();
>> +
>> +       if (!is_idle)
>> +               return;
>> +
>> +    /* TODO handle throttled cfs */
>> +    /* TODO handle update ilb blocked load update */
>> +    now = READ_ONCE(rq->clock_pelt_idle);
>> +    now += sched_clock_cpu(cpu_of(rq)) - READ_ONCE(rq->enter_idle);
>> +
>> +       __update_load_avg_blocked_se(now, se);
>> +}
>> +#else
>> +static void migrate_se_pelt_lag(struct sched_entity *se) {}
>> +#endif
>> +
>>   /*
>>    * Called immediately before a task is migrated to a new CPU; 
>> task_cpu(p) and
>>    * cfs_rq_of(p) references at time of call are still valid and 
>> identify the
>> @@ -7056,6 +7085,7 @@ static void migrate_task_rq_fair(struct 
>> task_struct *p, int new_cpu)
>>            * sounds not bad.
>>            */
>>           remove_entity_load_avg(&p->se);
>> +        migrate_se_pelt_lag(&p->se);
>>       }
>>       /* Tell new CPU we are migrated */
>> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
>> index c336f5f481bc..ece4423026e5 100644
>> --- a/kernel/sched/pelt.h
>> +++ b/kernel/sched/pelt.h
>> @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct 
>> sched_avg *avg)
>>       WRITE_ONCE(avg->util_est.enqueued, enqueued);
>>   }
>> +static inline u64 rq_clock_pelt(struct rq *rq)
>> +{
>> +    lockdep_assert_rq_held(rq);
>> +    assert_clock_updated(rq);
>> +
>> +    return rq->clock_pelt - rq->lost_idle_time;
>> +}
>> +
>>   /*
>>    * The clock_pelt scales the time to reflect the effective amount of
>>    * computation done during the running delta time but then sync back to
>> @@ -78,6 +86,8 @@ static inline void update_rq_clock_pelt(struct rq 
>> *rq, s64 delta)
>>       if (unlikely(is_idle_task(rq->curr))) {
>>           /* The rq is idle, we can sync to clock_task */
>>           rq->clock_pelt  = rq_clock_task(rq);
>> +        WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be 
>> factorized with idle_stamp */
>> +        WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last 
>> pelt clock update when idle */
>>           return;
>>       }
>> @@ -130,14 +140,11 @@ static inline void 
>> update_idle_rq_clock_pelt(struct rq *rq)
>>        */
>>       if (util_sum >= divider)
>>           rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
>> -}
>> -static inline u64 rq_clock_pelt(struct rq *rq)
>> -{
>> -    lockdep_assert_rq_held(rq);
>> -    assert_clock_updated(rq);
>> -
>> -    return rq->clock_pelt - rq->lost_idle_time;
>> +    /* The rq is idle, we can sync with clock_task */
>> +    rq->clock_pelt  = rq_clock_task(rq);
>> +    WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be 
>> factorized with idle_stamp */
>> +    WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt 
>> clock update when idle */
>>   }
>>   #ifdef CONFIG_CFS_BANDWIDTH
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6ab77b171656..108698446762 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1007,7 +1007,8 @@ struct rq {
>>       u64            clock_task ____cacheline_aligned;
>>       u64            clock_pelt;
>>       unsigned long        lost_idle_time;
>> -
>> +    u64            clock_pelt_idle;
>> +    u64            enter_idle;
>>       atomic_t        nr_iowait;
>>   #ifdef CONFIG_SCHED_DEBUG

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-20  9:34         ` Vincent Donnefort
  2022-04-20 10:25           ` Vincent Donnefort
@ 2022-04-21  7:36           ` Vincent Guittot
  1 sibling, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2022-04-21  7:36 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann, Morten.Rasmussen,
	Chris.Redpath, qperret

On Wed, 20 Apr 2022 at 11:34, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> On 19/04/2022 17:27, Vincent Guittot wrote:
> > Le mardi 19 avril 2022 � 13:23:27 (+0100), Vincent Donnefort a �crit :
> >>
> >>
> >> On 19/04/2022 11:08, Vincent Guittot wrote:
> >>> On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
> >>> <vincent.donnefort@arm.com> wrote:
> >>>>
> >>>> Before being migrated to a new CPU, a task sees its PELT values
> >>>> synchronized with rq last_update_time. Once done, that same task will also
> >>>> have its sched_avg last_update_time reset. This means the time between
> >>>> the migration and the last clock update (B) will not be accounted for in
> >>>> util_avg and a discontinuity will appear. This issue is amplified by the
> >>>> PELT clock scaling. If the clock hasn't been updated while the CPU is
> >>>> idle, clock_pelt will not be aligned with clock_task and that time (A)
> >>>> will be also lost.
> >>>>
> >>>>      ---------|----- A -----|-----------|------- B -----|>
> >>>>           clock_pelt   clock_task     clock            now
> >>>>
> >>>> This is especially problematic for asymmetric CPU capacity systems which
> >>>> need stable util_avg signals for task placement and energy estimation.
> >>>>
> >>>> Ideally, this problem would be solved by updating the runqueue clocks
> >>>> before the migration. But that would require taking the runqueue lock
> >>>> which is quite expensive [1]. Instead estimate the missing time and update
> >>>> the task util_avg with that value:
> >>>>
> >>>>     A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
> >>>>
> >>>> Neither clock_task, clock_pelt nor clock can be accessed without the
> >>>> runqueue lock. The new cfs_rq last_update_lag is therefore created and
> >>>> contains those three values when the last_update_time value for that very
> >>>> same cfs_rq is updated.
> >>>>
> >>>>     last_update_lag = clock - clock_task + clock_pelt
> >>>>
> >>>> And we can then write the missing time as follow:
> >>>>
> >>>>     A + B = sched_clock_cpu() - last_update_lag
> >>>>
> >>>> The B. part of the missing time is however an estimation that doesn't take
> >>>> into account IRQ and Paravirt time.
> >>>>
> >>>> Now we have an estimation for A + B, we can create an estimator for the
> >>>> PELT value at the time of the migration. We need for this purpose to
> >>>> inject last_update_time which is a combination of both clock_pelt and
> >>>> lost_idle_time. The latter is a time value which is completely lost from a
> >>>> PELT point of view and must be ignored. And finally, we can write:
> >>>>
> >>>>     now = last_update_time + A + B
> >>>>         = last_update_time + sched_clock_cpu() - last_update_lag
> >>>>
> >>>> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
> >>>> usage to the case where the source CPU is idle as we know this is when the
> >>>> clock is having the biggest risk of being outdated.
> >>>>
> >>>> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
> >>>>
> >>>> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> >
> > [...]
> >
> >>>
> >>> I'm worried that we will call this for each and every
> >>> update_cfs_rq_load_avg() whereas the content will be used only when
> >>> idle and not throttled. Can't we use these conditions to save values
> >>> only when needed and limit the number of useless updates ?
> >>
> >> I don't think we can use idle here as a condition, once it is idle, it is
> >> too late to get those clock values.
> >
> > As an example, the patch below should work. It doesn't handle the throttled case yet and still has to
> > make sure that rq->enter_idle and rq->clock_pelt_idle are coherent in regards to ILB that
> > update blocked load.
>
>
> I had to abandon the per-rq approach from v1 to v2. This is because of
> the following example:
>
> 1. task A sleep
> 2. rq's clock updated (e.g another task runs)
> 3. task A migrated
>
> With a per-rq lag, we would miss the time delta between 1 and 2. We know
> how old is the last clock update. But what we actually want is how old
> is the task's last_update_time.

I don't get your point here. Even after 2, won't task A be correctly
updated with the patch below ? As I mentioned, this doesn't take into
account case where cfs can be throttled (ie CFS_BANDWIDTH)

But applying similar scheme should work

>
> >
> > ---
> >   kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
> >   kernel/sched/pelt.h  | 21 ++++++++++++++-------
> >   kernel/sched/sched.h |  3 ++-
> >   3 files changed, 46 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e6ecf530f19f..f00843f9dd01 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7005,6 +7005,35 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >
> >   static void detach_entity_cfs_rq(struct sched_entity *se);
> >
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> > +{
> > +       u64 now;
> > +       struct cfs_rq *cfs_rq;
> > +       struct rq *rq;
> > +       bool is_idle;
> > +
> > +       cfs_rq = cfs_rq_of(se);
> > +       rq = rq_of(cfs_rq);
> > +
> > +       rcu_read_lock();
> > +       is_idle = is_idle_task(rcu_dereference(rq->curr));
> > +       rcu_read_unlock();
> > +
> > +       if (!is_idle)
> > +               return;
> > +
> > +     /* TODO handle throttled cfs */
> > +     /* TODO handle update ilb blocked load update */
> > +     now = READ_ONCE(rq->clock_pelt_idle);
> > +     now += sched_clock_cpu(cpu_of(rq)) - READ_ONCE(rq->enter_idle);
> > +
> > +       __update_load_avg_blocked_se(now, se);
> > +}
> > +#else
> > +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> > +#endif
> > +
> >   /*
> >    * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
> >    * cfs_rq_of(p) references at time of call are still valid and identify the
> > @@ -7056,6 +7085,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> >                * sounds not bad.
> >                */
> >               remove_entity_load_avg(&p->se);
> > +             migrate_se_pelt_lag(&p->se);
> >       }
> >
> >       /* Tell new CPU we are migrated */
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index c336f5f481bc..ece4423026e5 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
> >       WRITE_ONCE(avg->util_est.enqueued, enqueued);
> >   }
> >
> > +static inline u64 rq_clock_pelt(struct rq *rq)
> > +{
> > +     lockdep_assert_rq_held(rq);
> > +     assert_clock_updated(rq);
> > +
> > +     return rq->clock_pelt - rq->lost_idle_time;
> > +}
> > +
> >   /*
> >    * The clock_pelt scales the time to reflect the effective amount of
> >    * computation done during the running delta time but then sync back to
> > @@ -78,6 +86,8 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> >       if (unlikely(is_idle_task(rq->curr))) {
> >               /* The rq is idle, we can sync to clock_task */
> >               rq->clock_pelt  = rq_clock_task(rq);
> > +             WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
> > +             WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
> >               return;
> >       }
> >
> > @@ -130,14 +140,11 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> >        */
> >       if (util_sum >= divider)
> >               rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > -}
> >
> > -static inline u64 rq_clock_pelt(struct rq *rq)
> > -{
> > -     lockdep_assert_rq_held(rq);
> > -     assert_clock_updated(rq);
> > -
> > -     return rq->clock_pelt - rq->lost_idle_time;
> > +     /* The rq is idle, we can sync with clock_task */
> > +     rq->clock_pelt  = rq_clock_task(rq);
> > +     WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
> > +     WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
> >   }
> >
> >   #ifdef CONFIG_CFS_BANDWIDTH
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 6ab77b171656..108698446762 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1007,7 +1007,8 @@ struct rq {
> >       u64                     clock_task ____cacheline_aligned;
> >       u64                     clock_pelt;
> >       unsigned long           lost_idle_time;
> > -
> > +     u64                     clock_pelt_idle;
> > +     u64                     enter_idle;
> >       atomic_t                nr_iowait;
> >
> >   #ifdef CONFIG_SCHED_DEBUG

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-04-19 12:42   ` Tao Zhou
@ 2022-04-22 11:10     ` Vincent Donnefort
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2022-04-22 11:10 UTC (permalink / raw)
  To: Tao Zhou
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	morten.rasmussen, chris.redpath, qperret

[...]

>> +#ifdef CONFIG_64BIT
>> +# define u64_u32_load_copy(var, copy)       var
>> +# define u64_u32_store_copy(var, copy, val) (var = val)
>> +#else
>> +# define u64_u32_load_copy(var, copy)					\
>> +({									\
>> +	u64 __val, __val_copy;						\
>> +	do {								\
>> +		__val_copy = copy;					\
>> +		/*							\
>> +		 * paired with u64_u32_store, ordering access		\
>> +		 * to var and copy.					\
>> +		 */							\
>> +		smp_rmb();						\
>> +		__val = var;						\
>> +	} while (__val != __val_copy);					\
>> +	__val;								\
>> +})
>> +# define u64_u32_store_copy(var, copy, val)				\
>> +do {									\
>> +	typeof(val) __val = (val);					\
>> +	var = __val;							\
>> +	/*								\
>> +	 * paired with u64_u32_load, ordering access to var and		\
>> +	 * copy.							\
>> +	 */								\
>> +	smp_wmb();							\
>> +	copy = __val;							\
> 
> `copy = __val;` should be `copy = var`.
> 
> If var equal to val we do not need to do store. Check this condition
> in the above macro to avoid a redundant store.
> 
>   if (var != __val)
>     var = __val;

Judging by the users of this macro, var = val is very much unlikely to
happen. Also, I don't think we want to waste a if here.

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
  2022-04-12 13:42 ` [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration Vincent Donnefort
  2022-04-19 10:08   ` Vincent Guittot
@ 2022-04-22 15:14   ` Tao Zhou
  1 sibling, 0 replies; 19+ messages in thread
From: Tao Zhou @ 2022-04-22 15:14 UTC (permalink / raw)
  To: Vincent Donnefort, Tao Zhou
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	morten.rasmussen, chris.redpath, qperret

On Tue, Apr 12, 2022 at 02:42:15PM +0100, Vincent Donnefort wrote:
> Before being migrated to a new CPU, a task sees its PELT values
> synchronized with rq last_update_time. Once done, that same task will also
> have its sched_avg last_update_time reset. This means the time between
> the migration and the last clock update (B) will not be accounted for in
> util_avg and a discontinuity will appear. This issue is amplified by the
> PELT clock scaling. If the clock hasn't been updated while the CPU is
> idle, clock_pelt will not be aligned with clock_task and that time (A)
> will be also lost.
> 
>    ---------|----- A -----|-----------|------- B -----|>
>         clock_pelt   clock_task     clock            now
> 
> This is especially problematic for asymmetric CPU capacity systems which
> need stable util_avg signals for task placement and energy estimation.
> 
> Ideally, this problem would be solved by updating the runqueue clocks
> before the migration. But that would require taking the runqueue lock
> which is quite expensive [1]. Instead estimate the missing time and update
> the task util_avg with that value:
> 
>   A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
> 
> Neither clock_task, clock_pelt nor clock can be accessed without the
> runqueue lock. The new cfs_rq last_update_lag is therefore created and
> contains those three values when the last_update_time value for that very
> same cfs_rq is updated.
> 
>   last_update_lag = clock - clock_task + clock_pelt
> 
> And we can then write the missing time as follow:
> 
>   A + B = sched_clock_cpu() - last_update_lag
> 
> The B. part of the missing time is however an estimation that doesn't take
> into account IRQ and Paravirt time.
> 
> Now we have an estimation for A + B, we can create an estimator for the
> PELT value at the time of the migration. We need for this purpose to
> inject last_update_time which is a combination of both clock_pelt and
> lost_idle_time. The latter is a time value which is completely lost from a
> PELT point of view and must be ignored. And finally, we can write:
> 
>   now = last_update_time + A + B
>       = last_update_time + sched_clock_cpu() - last_update_lag
> 
> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
> usage to the case where the source CPU is idle as we know this is when the
> clock is having the biggest risk of being outdated.
> 
> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5dd38c9df0cc..e234d015657f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3694,6 +3694,57 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>  
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> +static inline void update_cfs_rq_lag(struct cfs_rq *cfs_rq)
> +{
> +	struct rq *rq = rq_of(cfs_rq);
> +
> +	u64_u32_store(cfs_rq->last_update_lag,
> +#ifdef CONFIG_CFS_BANDWIDTH
> +		      /* Timer stopped by throttling */
> +		      unlikely(cfs_rq->throttle_count) ? U64_MAX :
> +#endif
> +		      rq->clock - rq->clock_task + rq->clock_pelt);
> +}
> +
> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> +{
> +	u64 now, last_update_lag;
> +	struct cfs_rq *cfs_rq;
> +	struct rq *rq;
> +	bool is_idle;
> +
> +	cfs_rq = cfs_rq_of(se);
> +	rq = rq_of(cfs_rq);
> +
> +	rcu_read_lock();
> +	is_idle = is_idle_task(rcu_dereference(rq->curr));
> +	rcu_read_unlock();
> +
> +	/*
> +	 * The lag estimation comes with a cost we don't want to pay all the
> +	 * time. Hence, limiting to the case where the source CPU is idle and
> +	 * we know we are at the greatest risk to have an outdated clock.
> +	 */
> +	if (!is_idle)
> +		return;
> +
> +	last_update_lag = u64_u32_load(cfs_rq->last_update_lag);

If CPU is idle, clock_pelt is equal to clock_task, A part is
disappeared.

> +	/* The clock has been stopped for throttling */
> +	if (last_update_lag == U64_MAX)
> +		return;
> +
> +	now = se->avg.last_update_time - last_update_lag +
> +	      sched_clock_cpu(cpu_of(rq));
> +
> +	__update_load_avg_blocked_se(now, se);
> +}
> +#else
> +static void update_cfs_rq_lag(struct cfs_rq *cfs_rq) {}
> +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> +#endif
> +
>  /**
>   * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
>   * @now: current time, as per cfs_rq_clock_pelt()
> @@ -3774,6 +3825,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  			   cfs_rq->last_update_time_copy,
>  			   sa->last_update_time);
>  #endif
> +	update_cfs_rq_lag(cfs_rq);
>  
>  	return decayed;
>  }
> @@ -6946,6 +6998,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
>   */
>  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  {
> +	struct sched_entity *se = &p->se;
> +
>  	/*
>  	 * As blocked tasks retain absolute vruntime the migration needs to
>  	 * deal with this by subtracting the old and adding the new
> @@ -6953,7 +7007,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  	 * the task on the new runqueue.
>  	 */
>  	if (READ_ONCE(p->__state) == TASK_WAKING) {
> -		struct sched_entity *se = &p->se;
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> @@ -6965,25 +7018,28 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  		 * rq->lock and can modify state directly.
>  		 */
>  		lockdep_assert_rq_held(task_rq(p));
> -		detach_entity_cfs_rq(&p->se);
> +		detach_entity_cfs_rq(se);
>  
>  	} else {
> +		remove_entity_load_avg(se);
> +
>  		/*
> -		 * We are supposed to update the task to "current" time, then
> -		 * its up to date and ready to go to new CPU/cfs_rq. But we
> -		 * have difficulty in getting what current time is, so simply
> -		 * throw away the out-of-date time. This will result in the
> -		 * wakee task is less decayed, but giving the wakee more load
> -		 * sounds not bad.
> +		 * Here, the task's PELT values have been updated according to
> +		 * the current rq's clock. But if that clock hasn't been
> +		 * updated in a while, a substantial idle time will be missed,
> +		 * leading to an inflation after wake-up on the new rq.
> +		 *
> +		 * Estimate the missing time from the rq clock and update
> +		 * sched_avg to improve the PELT continuity after migration.
>  		 */
> -		remove_entity_load_avg(&p->se);
> +		migrate_se_pelt_lag(se);
>  	}
>  
>  	/* Tell new CPU we are migrated */
> -	p->se.avg.last_update_time = 0;
> +	se->avg.last_update_time = 0;
>  
>  	/* We have migrated, no longer consider this task hot */
> -	p->se.exec_start = 0;
> +	se->exec_start = 0;
>  
>  	update_scan_period(p, new_cpu);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e2cf6e48b165..2f6446295e7d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -593,6 +593,12 @@ struct cfs_rq {
>  	struct sched_avg	avg;
>  #ifndef CONFIG_64BIT
>  	u64			last_update_time_copy;
> +#endif
> +#ifdef CONFIG_NO_HZ_COMMON
> +	u64			last_update_lag;
> +#ifndef CONFIG_64BIT
> +	u64                     last_update_lag_copy;
> +#endif
>  #endif
>  	struct {
>  		raw_spinlock_t	lock ____cacheline_aligned;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: Remove the energy margin in feec()
@ 2022-04-15 11:58 kernel test robot
  0 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2022-04-15 11:58 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 17678 bytes --]

CC: kbuild-all(a)lists.01.org
BCC: lkp(a)intel.com
In-Reply-To: <20220412134220.1588482-8-vincent.donnefort@arm.com>
References: <20220412134220.1588482-8-vincent.donnefort@arm.com>
TO: Vincent Donnefort <vincent.donnefort@arm.com>

Hi Vincent,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on rafael-pm/linux-next rafael-pm/thermal v5.18-rc2 next-20220414]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/feec-energy-margin-removal/20220412-214441
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 089c02ae2771a14af2928c59c56abfb9b885a8d7
:::::: branch date: 3 days ago
:::::: commit date: 3 days ago
config: i386-randconfig-m021 (https://download.01.org/0day-ci/archive/20220415/202204151914.NKAe2ef1-lkp(a)intel.com/config)
compiler: gcc-11 (Debian 11.2.0-19) 11.2.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
kernel/sched/fair.c:6975 find_energy_efficient_cpu() error: uninitialized symbol 'best_energy_cpu'.

vim +/best_energy_cpu +6975 kernel/sched/fair.c

390031e4c309c9 Quentin Perret            2018-12-03  6814  
732cd75b8c920d Quentin Perret            2018-12-03  6815  /*
732cd75b8c920d Quentin Perret            2018-12-03  6816   * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
732cd75b8c920d Quentin Perret            2018-12-03  6817   * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
732cd75b8c920d Quentin Perret            2018-12-03  6818   * spare capacity in each performance domain and uses it as a potential
732cd75b8c920d Quentin Perret            2018-12-03  6819   * candidate to execute the task. Then, it uses the Energy Model to figure
732cd75b8c920d Quentin Perret            2018-12-03  6820   * out which of the CPU candidates is the most energy-efficient.
732cd75b8c920d Quentin Perret            2018-12-03  6821   *
732cd75b8c920d Quentin Perret            2018-12-03  6822   * The rationale for this heuristic is as follows. In a performance domain,
732cd75b8c920d Quentin Perret            2018-12-03  6823   * all the most energy efficient CPU candidates (according to the Energy
732cd75b8c920d Quentin Perret            2018-12-03  6824   * Model) are those for which we'll request a low frequency. When there are
732cd75b8c920d Quentin Perret            2018-12-03  6825   * several CPUs for which the frequency request will be the same, we don't
732cd75b8c920d Quentin Perret            2018-12-03  6826   * have enough data to break the tie between them, because the Energy Model
732cd75b8c920d Quentin Perret            2018-12-03  6827   * only includes active power costs. With this model, if we assume that
732cd75b8c920d Quentin Perret            2018-12-03  6828   * frequency requests follow utilization (e.g. using schedutil), the CPU with
732cd75b8c920d Quentin Perret            2018-12-03  6829   * the maximum spare capacity in a performance domain is guaranteed to be among
732cd75b8c920d Quentin Perret            2018-12-03  6830   * the best candidates of the performance domain.
732cd75b8c920d Quentin Perret            2018-12-03  6831   *
732cd75b8c920d Quentin Perret            2018-12-03  6832   * In practice, it could be preferable from an energy standpoint to pack
732cd75b8c920d Quentin Perret            2018-12-03  6833   * small tasks on a CPU in order to let other CPUs go in deeper idle states,
732cd75b8c920d Quentin Perret            2018-12-03  6834   * but that could also hurt our chances to go cluster idle, and we have no
732cd75b8c920d Quentin Perret            2018-12-03  6835   * ways to tell with the current Energy Model if this is actually a good
732cd75b8c920d Quentin Perret            2018-12-03  6836   * idea or not. So, find_energy_efficient_cpu() basically favors
732cd75b8c920d Quentin Perret            2018-12-03  6837   * cluster-packing, and spreading inside a cluster. That should at least be
732cd75b8c920d Quentin Perret            2018-12-03  6838   * a good thing for latency, and this is consistent with the idea that most
732cd75b8c920d Quentin Perret            2018-12-03  6839   * of the energy savings of EAS come from the asymmetry of the system, and
732cd75b8c920d Quentin Perret            2018-12-03  6840   * not so much from breaking the tie between identical CPUs. That's also the
732cd75b8c920d Quentin Perret            2018-12-03  6841   * reason why EAS is enabled in the topology code only for systems where
732cd75b8c920d Quentin Perret            2018-12-03  6842   * SD_ASYM_CPUCAPACITY is set.
732cd75b8c920d Quentin Perret            2018-12-03  6843   *
732cd75b8c920d Quentin Perret            2018-12-03  6844   * NOTE: Forkees are not accepted in the energy-aware wake-up path because
732cd75b8c920d Quentin Perret            2018-12-03  6845   * they don't have any useful utilization data yet and it's not possible to
732cd75b8c920d Quentin Perret            2018-12-03  6846   * forecast their impact on energy consumption. Consequently, they will be
732cd75b8c920d Quentin Perret            2018-12-03  6847   * placed by find_idlest_cpu() on the least loaded CPU, which might turn out
732cd75b8c920d Quentin Perret            2018-12-03  6848   * to be energy-inefficient in some use-cases. The alternative would be to
732cd75b8c920d Quentin Perret            2018-12-03  6849   * bias new tasks towards specific types of CPUs first, or to try to infer
732cd75b8c920d Quentin Perret            2018-12-03  6850   * their util_avg from the parent task, but those heuristics could hurt
732cd75b8c920d Quentin Perret            2018-12-03  6851   * other use-cases too. So, until someone finds a better way to solve this,
732cd75b8c920d Quentin Perret            2018-12-03  6852   * let's keep things simple by re-using the existing slow path.
732cd75b8c920d Quentin Perret            2018-12-03  6853   */
732cd75b8c920d Quentin Perret            2018-12-03  6854  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
732cd75b8c920d Quentin Perret            2018-12-03  6855  {
04409022d37d2c Dietmar Eggemann          2022-04-12  6856  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
eb92692b2544d3 Quentin Perret            2019-09-12  6857  	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
ce557de247bd64 Vincent Donnefort         2022-04-12  6858  	struct root_domain *rd = this_rq()->rd;
5311f1261af84b Vincent Donnefort         2022-04-12  6859  	int cpu, best_energy_cpu, target = -1;
732cd75b8c920d Quentin Perret            2018-12-03  6860  	struct sched_domain *sd;
eb92692b2544d3 Quentin Perret            2019-09-12  6861  	struct perf_domain *pd;
ce557de247bd64 Vincent Donnefort         2022-04-12  6862  	struct energy_env eenv;
732cd75b8c920d Quentin Perret            2018-12-03  6863  
732cd75b8c920d Quentin Perret            2018-12-03  6864  	rcu_read_lock();
732cd75b8c920d Quentin Perret            2018-12-03  6865  	pd = rcu_dereference(rd->pd);
732cd75b8c920d Quentin Perret            2018-12-03  6866  	if (!pd || READ_ONCE(rd->overutilized))
619e090c8e409e Pierre Gondois            2021-05-04  6867  		goto unlock;
732cd75b8c920d Quentin Perret            2018-12-03  6868  
732cd75b8c920d Quentin Perret            2018-12-03  6869  	/*
732cd75b8c920d Quentin Perret            2018-12-03  6870  	 * Energy-aware wake-up happens on the lowest sched_domain starting
732cd75b8c920d Quentin Perret            2018-12-03  6871  	 * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
732cd75b8c920d Quentin Perret            2018-12-03  6872  	 */
732cd75b8c920d Quentin Perret            2018-12-03  6873  	sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
732cd75b8c920d Quentin Perret            2018-12-03  6874  	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
732cd75b8c920d Quentin Perret            2018-12-03  6875  		sd = sd->parent;
732cd75b8c920d Quentin Perret            2018-12-03  6876  	if (!sd)
619e090c8e409e Pierre Gondois            2021-05-04  6877  		goto unlock;
619e090c8e409e Pierre Gondois            2021-05-04  6878  
619e090c8e409e Pierre Gondois            2021-05-04  6879  	target = prev_cpu;
732cd75b8c920d Quentin Perret            2018-12-03  6880  
732cd75b8c920d Quentin Perret            2018-12-03  6881  	sync_entity_load_avg(&p->se);
732cd75b8c920d Quentin Perret            2018-12-03  6882  	if (!task_util_est(p))
732cd75b8c920d Quentin Perret            2018-12-03  6883  		goto unlock;
732cd75b8c920d Quentin Perret            2018-12-03  6884  
ce557de247bd64 Vincent Donnefort         2022-04-12  6885  	eenv_task_busy_time(&eenv, p, prev_cpu);
ce557de247bd64 Vincent Donnefort         2022-04-12  6886  
732cd75b8c920d Quentin Perret            2018-12-03  6887  	for (; pd; pd = pd->next) {
ce557de247bd64 Vincent Donnefort         2022-04-12  6888  		unsigned long cpu_cap, cpu_thermal_cap, util;
ce557de247bd64 Vincent Donnefort         2022-04-12  6889  		unsigned long cur_delta, max_spare_cap = 0;
8d4c97c105ca07 Pierre Gondois            2021-05-04  6890  		bool compute_prev_delta = false;
732cd75b8c920d Quentin Perret            2018-12-03  6891  		int max_spare_cap_cpu = -1;
5311f1261af84b Vincent Donnefort         2022-04-12  6892  		unsigned long base_energy;
732cd75b8c920d Quentin Perret            2018-12-03  6893  
04409022d37d2c Dietmar Eggemann          2022-04-12  6894  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
04409022d37d2c Dietmar Eggemann          2022-04-12  6895  
ce557de247bd64 Vincent Donnefort         2022-04-12  6896  		/* Account thermal pressure for the energy estimation */
ce557de247bd64 Vincent Donnefort         2022-04-12  6897  		cpu = cpumask_first(cpus);
ce557de247bd64 Vincent Donnefort         2022-04-12  6898  		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
ce557de247bd64 Vincent Donnefort         2022-04-12  6899  		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
ce557de247bd64 Vincent Donnefort         2022-04-12  6900  
ce557de247bd64 Vincent Donnefort         2022-04-12  6901  		eenv.cpu_cap = cpu_thermal_cap;
ce557de247bd64 Vincent Donnefort         2022-04-12  6902  		eenv.pd_cap = 0;
ce557de247bd64 Vincent Donnefort         2022-04-12  6903  
ce557de247bd64 Vincent Donnefort         2022-04-12  6904  		for_each_cpu(cpu, cpus) {
ce557de247bd64 Vincent Donnefort         2022-04-12  6905  			eenv.pd_cap += cpu_thermal_cap;
ce557de247bd64 Vincent Donnefort         2022-04-12  6906  
ce557de247bd64 Vincent Donnefort         2022-04-12  6907  			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
ce557de247bd64 Vincent Donnefort         2022-04-12  6908  				continue;
ce557de247bd64 Vincent Donnefort         2022-04-12  6909  
3bd3706251ee8a Sebastian Andrzej Siewior 2019-04-23  6910  			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
732cd75b8c920d Quentin Perret            2018-12-03  6911  				continue;
732cd75b8c920d Quentin Perret            2018-12-03  6912  
732cd75b8c920d Quentin Perret            2018-12-03  6913  			util = cpu_util_next(cpu, p, cpu);
732cd75b8c920d Quentin Perret            2018-12-03  6914  			cpu_cap = capacity_of(cpu);
1d42509e475cdc Valentin Schneider        2019-12-11  6915  
1d42509e475cdc Valentin Schneider        2019-12-11  6916  			/*
1d42509e475cdc Valentin Schneider        2019-12-11  6917  			 * Skip CPUs that cannot satisfy the capacity request.
1d42509e475cdc Valentin Schneider        2019-12-11  6918  			 * IOW, placing the task there would make the CPU
1d42509e475cdc Valentin Schneider        2019-12-11  6919  			 * overutilized. Take uclamp into account to see how
1d42509e475cdc Valentin Schneider        2019-12-11  6920  			 * much capacity we can get out of the CPU; this is
a5418be9dffe70 Viresh Kumar              2020-12-08  6921  			 * aligned with sched_cpu_util().
1d42509e475cdc Valentin Schneider        2019-12-11  6922  			 */
1d42509e475cdc Valentin Schneider        2019-12-11  6923  			util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
60e17f5cef838e Viresh Kumar              2019-06-04  6924  			if (!fits_capacity(util, cpu_cap))
732cd75b8c920d Quentin Perret            2018-12-03  6925  				continue;
732cd75b8c920d Quentin Perret            2018-12-03  6926  
ce557de247bd64 Vincent Donnefort         2022-04-12  6927  			lsub_positive(&cpu_cap, util);
ce557de247bd64 Vincent Donnefort         2022-04-12  6928  
732cd75b8c920d Quentin Perret            2018-12-03  6929  			if (cpu == prev_cpu) {
8d4c97c105ca07 Pierre Gondois            2021-05-04  6930  				/* Always use prev_cpu as a candidate. */
8d4c97c105ca07 Pierre Gondois            2021-05-04  6931  				compute_prev_delta = true;
ce557de247bd64 Vincent Donnefort         2022-04-12  6932  			} else if (cpu_cap > max_spare_cap) {
732cd75b8c920d Quentin Perret            2018-12-03  6933  				/*
8d4c97c105ca07 Pierre Gondois            2021-05-04  6934  				 * Find the CPU with the maximum spare capacity
8d4c97c105ca07 Pierre Gondois            2021-05-04  6935  				 * in the performance domain.
732cd75b8c920d Quentin Perret            2018-12-03  6936  				 */
ce557de247bd64 Vincent Donnefort         2022-04-12  6937  				max_spare_cap = cpu_cap;
732cd75b8c920d Quentin Perret            2018-12-03  6938  				max_spare_cap_cpu = cpu;
732cd75b8c920d Quentin Perret            2018-12-03  6939  			}
732cd75b8c920d Quentin Perret            2018-12-03  6940  		}
732cd75b8c920d Quentin Perret            2018-12-03  6941  
8d4c97c105ca07 Pierre Gondois            2021-05-04  6942  		if (max_spare_cap_cpu < 0 && !compute_prev_delta)
8d4c97c105ca07 Pierre Gondois            2021-05-04  6943  			continue;
8d4c97c105ca07 Pierre Gondois            2021-05-04  6944  
8d4c97c105ca07 Pierre Gondois            2021-05-04  6945  		/* Compute the 'base' energy of the pd, without @p */
ce557de247bd64 Vincent Donnefort         2022-04-12  6946  		eenv_pd_busy_time(&eenv, cpus, p);
5311f1261af84b Vincent Donnefort         2022-04-12  6947  		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
8d4c97c105ca07 Pierre Gondois            2021-05-04  6948  
8d4c97c105ca07 Pierre Gondois            2021-05-04  6949  		/* Evaluate the energy impact of using prev_cpu. */
8d4c97c105ca07 Pierre Gondois            2021-05-04  6950  		if (compute_prev_delta) {
ce557de247bd64 Vincent Donnefort         2022-04-12  6951  			prev_delta = compute_energy(&eenv, pd, cpus, p,
ce557de247bd64 Vincent Donnefort         2022-04-12  6952  						    prev_cpu);
5311f1261af84b Vincent Donnefort         2022-04-12  6953  			if (prev_delta < base_energy)
619e090c8e409e Pierre Gondois            2021-05-04  6954  				goto unlock;
5311f1261af84b Vincent Donnefort         2022-04-12  6955  			prev_delta -= base_energy;
8d4c97c105ca07 Pierre Gondois            2021-05-04  6956  			best_delta = min(best_delta, prev_delta);
8d4c97c105ca07 Pierre Gondois            2021-05-04  6957  		}
8d4c97c105ca07 Pierre Gondois            2021-05-04  6958  
8d4c97c105ca07 Pierre Gondois            2021-05-04  6959  		/* Evaluate the energy impact of using max_spare_cap_cpu. */
8d4c97c105ca07 Pierre Gondois            2021-05-04  6960  		if (max_spare_cap_cpu >= 0) {
ce557de247bd64 Vincent Donnefort         2022-04-12  6961  			cur_delta = compute_energy(&eenv, pd, cpus, p,
ce557de247bd64 Vincent Donnefort         2022-04-12  6962  						   max_spare_cap_cpu);
5311f1261af84b Vincent Donnefort         2022-04-12  6963  			if (cur_delta < base_energy)
619e090c8e409e Pierre Gondois            2021-05-04  6964  				goto unlock;
5311f1261af84b Vincent Donnefort         2022-04-12  6965  			cur_delta -= base_energy;
eb92692b2544d3 Quentin Perret            2019-09-12  6966  			if (cur_delta < best_delta) {
eb92692b2544d3 Quentin Perret            2019-09-12  6967  				best_delta = cur_delta;
732cd75b8c920d Quentin Perret            2018-12-03  6968  				best_energy_cpu = max_spare_cap_cpu;
732cd75b8c920d Quentin Perret            2018-12-03  6969  			}
732cd75b8c920d Quentin Perret            2018-12-03  6970  		}
732cd75b8c920d Quentin Perret            2018-12-03  6971  	}
732cd75b8c920d Quentin Perret            2018-12-03  6972  	rcu_read_unlock();
732cd75b8c920d Quentin Perret            2018-12-03  6973  
5311f1261af84b Vincent Donnefort         2022-04-12  6974  	if (best_delta < prev_delta)
619e090c8e409e Pierre Gondois            2021-05-04 @6975  		target = best_energy_cpu;
732cd75b8c920d Quentin Perret            2018-12-03  6976  
619e090c8e409e Pierre Gondois            2021-05-04  6977  	return target;
732cd75b8c920d Quentin Perret            2018-12-03  6978  
619e090c8e409e Pierre Gondois            2021-05-04  6979  unlock:
732cd75b8c920d Quentin Perret            2018-12-03  6980  	rcu_read_unlock();
732cd75b8c920d Quentin Perret            2018-12-03  6981  
619e090c8e409e Pierre Gondois            2021-05-04  6982  	return target;
732cd75b8c920d Quentin Perret            2018-12-03  6983  }
732cd75b8c920d Quentin Perret            2018-12-03  6984  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-04-22 15:14 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-12 13:42 [PATCH v3 0/7] feec() energy margin removal Vincent Donnefort
2022-04-12 13:42 ` [PATCH v4 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
2022-04-19 12:42   ` Tao Zhou
2022-04-22 11:10     ` Vincent Donnefort
2022-04-12 13:42 ` [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration Vincent Donnefort
2022-04-19 10:08   ` Vincent Guittot
2022-04-19 12:23     ` Vincent Donnefort
2022-04-19 16:27       ` Vincent Guittot
2022-04-20  9:34         ` Vincent Donnefort
2022-04-20 10:25           ` Vincent Donnefort
2022-04-21  7:36           ` Vincent Guittot
2022-04-22 15:14   ` Tao Zhou
2022-04-12 13:42 ` [PATCH v4 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
2022-04-12 13:42 ` [PATCH v4 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
2022-04-12 13:42 ` [PATCH v4 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
2022-04-12 13:42 ` [PATCH v4 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
2022-04-12 13:42 ` [PATCH v4 7/7] sched/fair: Remove the energy margin " Vincent Donnefort
2022-04-14 19:32 ` [PATCH v3 0/7] feec() energy margin removal Dietmar Eggemann
2022-04-15 11:58 [PATCH v4 7/7] sched/fair: Remove the energy margin in feec() kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.