All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] feec() energy margin removal
@ 2022-01-12 16:12 Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba,
	Vincent Donnefort

find_energy_efficient() (feec()) will migrate a task to save energy only
if it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to migrate
to a smaller CPU, wasting a lot of energy. Instead of trying to determine yet
another margin, let's try to remove it.

The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs of
the topology. Only once those fixed, we can completely remove the margin and
let feec() aggressively place task and save energy.

This has been validated by two different ways:

First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the placement
is now more energy oriented) and *ThreeSmall (as no bouncing between clusters
happen anymore).

  * Hikey960: 100% PASSED
  * DB-845C:  100% PASSED
  * RB5:      100% PASSED

Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline. 

  +------------+-----------------+-----------------+
  |    Test    |      Perf       |    Energy [1]   |
  +------------+-----------------+-----------------+
  | Web2       | -0.3% pval 0.03 | -1.8% pval 0.00 |
  | Video2     | -0.3% pval 0.13 | -5.6% pval 0.00 |
  | Photo2 [2] | -3.8% pval 0.00 | -1%   pval 0.00 |
  | Writing2   |  0%   pval 0.13 | -1%   pval 0.00 |
  | Data2      |  0%   pval 0.8  | -0.43 pval 0.00 |
  +------------+-----------------+-----------------+ 

The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a 
substantial amount of energy, while having a limited impact on performances.

[1] This is an energy estimation based on the CPU activity and the Energy Model
for this device. "All models are wrong but some are useful"; yes, this is an
imperfect estimation that doesn't take into account some idle states and shared
power rails. Nonetheless this is based on the information the kernel has during
runtime and it proves the scheduler can take better decisions based solely on
those data.

[2] This is the only performance impact observed. The debugging of this test
showed no issue with task placement. The better score was solely due to some

v1 -> v2:
  - Fix PELT migration last_update_time (previously root cfs_rq's).
  - Add Dietmar's patches to refactor feec()'s CPU loop.
  - feec(): renaming busy time functions get_{pd,tsk}_busy_time()
  - feec(): pd_cap computation in the first for_each_cpu loop.
  - feec(): create get_pd_max_util() function (previously within compute_energy())
  - feec(): rename base_energy_pd to base_energy.

Dietmar Eggemann (3):
  sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
  sched/fair: Rename select_idle_mask to select_rq_mask
  sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()

Vincent Donnefort (4):
  sched/fair: Provide u64 read for 32-bits arch helper
  sched/fair: Decay task PELT values during migration
  sched/fair: Remove task_util from effective utilization in feec()
  sched/fair: Remove the energy margin in feec()

 drivers/powercap/dtpm_cpu.c       |  33 +---
 drivers/thermal/cpufreq_cooling.c |   6 +-
 include/linux/sched.h             |   2 +-
 kernel/sched/core.c               |  22 ++-
 kernel/sched/cpufreq_schedutil.c  |   5 +-
 kernel/sched/fair.c               | 313 ++++++++++++++++--------------
 kernel/sched/sched.h              |  48 ++++-
 7 files changed, 238 insertions(+), 191 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  2022-01-17 16:11   ` Tao Zhou
  2022-01-12 16:12 ` [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Vincent Donnefort
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba,
	Vincent Donnefort

Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.

Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.

Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty in set_task_rq_fair() and init_cfs_rq().

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..99ea9540ece4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -568,11 +568,8 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
-#ifndef CONFIG_64BIT
-	smp_wmb();
-	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
-#endif
+	u64_u32_store(cfs_rq->min_vruntime,
+		      max_vruntime(cfs_rq->min_vruntime, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -3246,6 +3243,11 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
 }
 
 #ifdef CONFIG_SMP
+static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
+{
+	return u64_u32_load_copy(cfs_rq->avg.last_update_time,
+				 cfs_rq->last_update_time_copy);
+}
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
@@ -3356,27 +3358,9 @@ void set_task_rq_fair(struct sched_entity *se,
 	if (!(se->avg.last_update_time && prev))
 		return;
 
-#ifndef CONFIG_64BIT
-	{
-		u64 p_last_update_time_copy;
-		u64 n_last_update_time_copy;
-
-		do {
-			p_last_update_time_copy = prev->load_last_update_time_copy;
-			n_last_update_time_copy = next->load_last_update_time_copy;
-
-			smp_rmb();
+	p_last_update_time = cfs_rq_last_update_time(prev);
+	n_last_update_time = cfs_rq_last_update_time(next);
 
-			p_last_update_time = prev->avg.last_update_time;
-			n_last_update_time = next->avg.last_update_time;
-
-		} while (p_last_update_time != p_last_update_time_copy ||
-			 n_last_update_time != n_last_update_time_copy);
-	}
-#else
-	p_last_update_time = prev->avg.last_update_time;
-	n_last_update_time = next->avg.last_update_time;
-#endif
 	__update_load_avg_blocked_se(p_last_update_time, se);
 	se->avg.last_update_time = n_last_update_time;
 }
@@ -3700,8 +3684,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
 
 #ifndef CONFIG_64BIT
-	smp_wmb();
-	cfs_rq->load_last_update_time_copy = sa->last_update_time;
+	u64_u32_store_copy(sa->last_update_time,
+			   cfs_rq->last_update_time_copy,
+			   sa->last_update_time);
 #endif
 
 	return decayed;
@@ -3834,27 +3819,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	}
 }
 
-#ifndef CONFIG_64BIT
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
-	u64 last_update_time_copy;
-	u64 last_update_time;
-
-	do {
-		last_update_time_copy = cfs_rq->load_last_update_time_copy;
-		smp_rmb();
-		last_update_time = cfs_rq->avg.last_update_time;
-	} while (last_update_time != last_update_time_copy);
-
-	return last_update_time;
-}
-#else
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->avg.last_update_time;
-}
-#endif
-
 /*
  * Synchronize entity load avg of dequeued entity without locking
  * the previous rq.
@@ -6904,21 +6868,8 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	if (READ_ONCE(p->__state) == TASK_WAKING) {
 		struct sched_entity *se = &p->se;
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		u64 min_vruntime;
 
-#ifndef CONFIG_64BIT
-		u64 min_vruntime_copy;
-
-		do {
-			min_vruntime_copy = cfs_rq->min_vruntime_copy;
-			smp_rmb();
-			min_vruntime = cfs_rq->min_vruntime;
-		} while (min_vruntime != min_vruntime_copy);
-#else
-		min_vruntime = cfs_rq->min_vruntime;
-#endif
-
-		se->vruntime -= min_vruntime;
+		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
 	}
 
 	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
@@ -11362,10 +11313,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
-#ifndef CONFIG_64BIT
-	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
-#endif
+	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be905739..f1a445efdc63 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -528,6 +528,45 @@ struct cfs_bandwidth { };
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+/*
+ * u64_u32_load/u64_u32_store
+ *
+ * Use a copy of a u64 value to protect against data race. This is only
+ * applicable for 32-bits architectures.
+ */
+#ifdef CONFIG_64BIT
+# define u64_u32_load_copy(var, copy)       var
+# define u64_u32_store_copy(var, copy, val) (var = val)
+#else
+# define u64_u32_load_copy(var, copy)					\
+({									\
+	u64 __val, __val_copy;						\
+	do {								\
+		__val_copy = copy;					\
+		/*							\
+		 * paired with u64_u32_store, ordering access		\
+		 * to var and copy.					\
+		 */							\
+		smp_rmb();						\
+		__val = var;						\
+	} while (__val != __val_copy);					\
+	__val;								\
+})
+# define u64_u32_store_copy(var, copy, val)				\
+do {									\
+	typeof(val) __val = (val);					\
+	var = __val;							\
+	/*								\
+	 * paired with u64_u32_load, ordering access to var and		\
+	 * copy.							\
+	 */								\
+	smp_wmb();							\
+	copy = __val;							\
+} while (0)
+#endif
+# define u64_u32_load(var)      u64_u32_load_copy(var, var##_copy)
+# define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
@@ -568,7 +607,7 @@ struct cfs_rq {
 	 */
 	struct sched_avg	avg;
 #ifndef CONFIG_64BIT
-	u64			load_last_update_time_copy;
+	u64			last_update_time_copy;
 #endif
 	struct {
 		raw_spinlock_t	lock ____cacheline_aligned;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  2022-01-17 17:31   ` Vincent Guittot
  2022-01-12 16:12 ` [PATCH v2 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba,
	Vincent Donnefort

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

   ---------|----- A -----|-----------|------- B -----|>
        clock_pelt   clock_task     clock            now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

  A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. The new runqueue clock_pelt_lag is therefore created and
encode those three values.

  clock_pelt_lag = clock - clock_task + clock_pelt

And we can then write the missing time as follow:

  A + B = sched_clock_cpu() - clock_pelt_lag

The B. part of the missing time is however an estimation that doesn't take
into account IRQ and Paravirt time.

Now we have an estimation for A + B, we can create an estimator for the
PELT value at the time of the migration. We need for this purpose to
inject last_update_time which is a combination of both clock_pelt and
lost_idle_time. The latter is a time value which is completely lost form a
PELT point of view and must be ignored. And finally, we can write:

  rq_clock_pelt_estimator() = last_update_time + A + B
                            = last_update_time +
                                   sched_clock_cpu() - clock_pelt_lag

[1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06cf7620839a..11c6aeef4583 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -618,6 +618,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	}
 }
 
+static void update_rq_clock_pelt_lag(struct rq *rq)
+{
+	u64_u32_store(rq->clock_pelt_lag,
+		      rq->clock - rq->clock_task + rq->clock_pelt);
+}
+
 /*
  * RQ-clock updating methods:
  */
@@ -674,6 +680,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
 	update_rq_clock_pelt(rq, delta);
+	update_rq_clock_pelt_lag(rq);
 }
 
 void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99ea9540ece4..046d5397eb8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6852,6 +6852,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 
 static void detach_entity_cfs_rq(struct sched_entity *se);
 
+static u64 rq_clock_pelt_estimator(struct rq *rq, u64 last_update_time)
+{
+	u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
+		       u64_u32_load(rq->clock_pelt_lag);
+
+	return last_update_time + pelt_lag;
+}
+
 /*
  * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6859,6 +6867,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
  */
 static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
+	struct sched_entity *se = &p->se;
+	struct rq *rq = task_rq(p);
+
 	/*
 	 * As blocked tasks retain absolute vruntime the migration needs to
 	 * deal with this by subtracting the old and adding the new
@@ -6866,7 +6877,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	 * the task on the new runqueue.
 	 */
 	if (READ_ONCE(p->__state) == TASK_WAKING) {
-		struct sched_entity *se = &p->se;
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
@@ -6877,26 +6887,32 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_rq_held(task_rq(p));
-		detach_entity_cfs_rq(&p->se);
+		lockdep_assert_rq_held(rq);
+		detach_entity_cfs_rq(se);
 
 	} else {
+		u64 now;
+
+		remove_entity_load_avg(se);
+
 		/*
-		 * We are supposed to update the task to "current" time, then
-		 * its up to date and ready to go to new CPU/cfs_rq. But we
-		 * have difficulty in getting what current time is, so simply
-		 * throw away the out-of-date time. This will result in the
-		 * wakee task is less decayed, but giving the wakee more load
-		 * sounds not bad.
+		 * Here, the task's PELT values have been updated according to
+		 * the current rq's clock. But if that clock hasn't been
+		 * updated in a while, a substantial idle time will be missed,
+		 * leading to an inflation after wake-up on the new rq.
+		 *
+		 * Estimate the PELT clock lag, and update sched_avg to ensure
+		 * PELT continuity after migration.
 		 */
-		remove_entity_load_avg(&p->se);
+		now = rq_clock_pelt_estimator(rq, se->avg.last_update_time);
+		__update_load_avg_blocked_se(now, se);
 	}
 
 	/* Tell new CPU we are migrated */
-	p->se.avg.last_update_time = 0;
+	se->avg.last_update_time = 0;
 
 	/* We have migrated, no longer consider this task hot */
-	p->se.exec_start = 0;
+	se->exec_start = 0;
 
 	update_scan_period(p, new_cpu);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f1a445efdc63..fdf2a9e54c0e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1027,8 +1027,13 @@ struct rq {
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
 	u64			clock_pelt;
+	u64			clock_pelt_lag;
 	unsigned long		lost_idle_time;
 
+#ifndef CONFIG_64BIT
+	u64                     clock_pelt_lag_copy;
+#endif
+
 	atomic_t		nr_iowait;
 
 #ifdef CONFIG_SCHED_DEBUG
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).

A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.

Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index b740866b228d..0d57bcf83ae5 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -70,34 +70,19 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 
 static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
 {
-	unsigned long max = 0, sum_util = 0;
+	unsigned long max, sum_util = 0;
 	int cpu;
 
-	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-
-		/*
-		 * The capacity is the same for all CPUs belonging to
-		 * the same perf domain, so a single call to
-		 * arch_scale_cpu_capacity() is enough. However, we
-		 * need the CPU parameter to be initialized by the
-		 * loop, so the call ends up in this block.
-		 *
-		 * We can initialize 'max' with a cpumask_first() call
-		 * before the loop but the bits computation is not
-		 * worth given the arch_scale_cpu_capacity() just
-		 * returns a value where the resulting assembly code
-		 * will be optimized by the compiler.
-		 */
-		max = arch_scale_cpu_capacity(cpu);
-		sum_util += sched_cpu_util(cpu, max);
-	}
-
 	/*
-	 * In the improbable case where all the CPUs of the perf
-	 * domain are offline, 'max' will be zero and will lead to an
-	 * illegal operation with a zero division.
+	 * The capacity is the same for all CPUs belonging to
+	 * the same perf domain.
 	 */
-	return max ? (power * ((sum_util << 10) / max)) >> 10 : 0;
+	max = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+
+	for_each_cpu_and(cpu, pd_mask, cpu_online_mask)
+		sum_util += sched_cpu_util(cpu);
+
+	return (power * ((sum_util << 10) / max)) >> 10;
 }
 
 static u64 get_pd_power_uw(struct dtpm *dtpm)
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index 43b1ae8a7789..f499f3c0e633 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -137,11 +137,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
 		    int cpu_idx)
 {
-	unsigned long max = arch_scale_cpu_capacity(cpu);
-	unsigned long util;
+	unsigned long util = sched_cpu_util(cpu);
 
-	util = sched_cpu_util(cpu, max);
-	return (util * 100) / max;
+	return (util * 100) / arch_scale_cpu_capacity(cpu);
 }
 #else /* !CONFIG_SMP */
 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2e261adb8ea..d9e672ab71f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2176,7 +2176,7 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
 #ifdef CONFIG_SMP
 /* Returns effective CPU energy utilization, as seen by the scheduler */
-unsigned long sched_cpu_util(int cpu, unsigned long max);
+unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11c6aeef4583..668ffae1888e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7086,12 +7086,14 @@ struct task_struct *idle_task(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum cpu_util_type type,
+				 enum cpu_util_type type,
 				 struct task_struct *p)
 {
-	unsigned long dl_util, util, irq;
+	unsigned long dl_util, util, irq, max;
 	struct rq *rq = cpu_rq(cpu);
 
+	max = arch_scale_cpu_capacity(cpu);
+
 	if (!uclamp_is_used() &&
 	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
 		return max;
@@ -7171,10 +7173,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	return min(max, util);
 }
 
-unsigned long sched_cpu_util(int cpu, unsigned long max)
+unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), max,
-				  ENERGY_UTIL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 26778884d9ab..9b88fc8c6ea8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -164,11 +164,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
 
-	sg_cpu->max = max;
+	sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu);
 	sg_cpu->bw_dl = cpu_bw_dl(rq);
-	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
+	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
 					  FREQUENCY_UTIL, NULL);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 046d5397eb8a..4fc63deda620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6558,12 +6558,11 @@ static long
 compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 {
 	struct cpumask *pd_mask = perf_domain_span(pd);
-	unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
-	unsigned long max_util = 0, sum_util = 0;
-	unsigned long _cpu_cap = cpu_cap;
+	unsigned long max_util = 0, sum_util = 0, cpu_cap;
 	int cpu;
 
-	_cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+	cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
 
 	/*
 	 * The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6600,10 +6599,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
-					      ENERGY_UTIL, NULL);
+		cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
+					      NULL);
 
-		sum_util += min(cpu_util, _cpu_cap);
+		sum_util += min(cpu_util, cpu_cap);
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -6612,12 +6611,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
-					      FREQUENCY_UTIL, tsk);
-		max_util = max(max_util, min(cpu_util, _cpu_cap));
+		cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
+					      tsk);
+		max_util = max(max_util, min(cpu_util, cpu_cap));
 	}
 
-	return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
+	return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fdf2a9e54c0e..135c37358dc0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2997,7 +2997,7 @@ enum cpu_util_type {
 };
 
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum cpu_util_type type,
+				 enum cpu_util_type type,
 				 struct task_struct *p);
 
 static inline unsigned long cpu_bw_dl(struct rq *rq)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/7] sched/fair: Rename select_idle_mask to select_rq_mask
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
                   ` (2 preceding siblings ...)
  2022-01-12 16:12 ` [PATCH v2 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).

This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 668ffae1888e..9808f3f33417 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9305,7 +9305,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
-DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_rq_mask);
 
 void __init sched_init(void)
 {
@@ -9354,7 +9354,7 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
-		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+		per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4fc63deda620..291be5c00044 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5709,7 +5709,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 
@@ -6199,7 +6199,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
@@ -6285,7 +6285,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
-	cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	task_util = uclamp_task_util(p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
                   ` (3 preceding siblings ...)
  2022-01-12 16:12 ` [PATCH v2 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
  2022-01-12 16:12 ` [PATCH v2 7/7] sched/fair: Remove the energy margin " Vincent Donnefort
  6 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.

That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.

cpu_online_mask may change during this time which can lead to wrong
energy calculations.

To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.

The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 291be5c00044..cfc0d9b3eb19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6555,14 +6555,14 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
  * task.
  */
 static long
-compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
+compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
+	       struct perf_domain *pd)
 {
-	struct cpumask *pd_mask = perf_domain_span(pd);
 	unsigned long max_util = 0, sum_util = 0, cpu_cap;
 	int cpu;
 
-	cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
-	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+	cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
+	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
 
 	/*
 	 * The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6573,7 +6573,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	 * If an entire pd is outside of the current rd, it will not appear in
 	 * its pd list and will not be accounted by compute_energy().
 	 */
-	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+	for_each_cpu(cpu, cpus) {
 		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
 		unsigned long cpu_util, util_running = util_freq;
 		struct task_struct *tsk = NULL;
@@ -6660,6 +6660,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
  */
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
 	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
 	int cpu, best_energy_cpu = prev_cpu, target = -1;
@@ -6694,7 +6695,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		unsigned long base_energy_pd;
 		int max_spare_cap_cpu = -1;
 
-		for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
+
+		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
@@ -6731,12 +6734,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			continue;
 
 		/* Compute the 'base' energy of the pd, without @p */
-		base_energy_pd = compute_energy(p, -1, pd);
+		base_energy_pd = compute_energy(p, -1, cpus, pd);
 		base_energy += base_energy_pd;
 
 		/* Evaluate the energy impact of using prev_cpu. */
 		if (compute_prev_delta) {
-			prev_delta = compute_energy(p, prev_cpu, pd);
+			prev_delta = compute_energy(p, prev_cpu, cpus, pd);
 			if (prev_delta < base_energy_pd)
 				goto unlock;
 			prev_delta -= base_energy_pd;
@@ -6745,7 +6748,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 		/* Evaluate the energy impact of using max_spare_cap_cpu. */
 		if (max_spare_cap_cpu >= 0) {
-			cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+			cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
+						   pd);
 			if (cur_delta < base_energy_pd)
 				goto unlock;
 			cur_delta -= base_energy_pd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec()
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
                   ` (4 preceding siblings ...)
  2022-01-12 16:12 ` [PATCH v2 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  2022-01-12 19:44     ` kernel test robot
  2022-01-17 13:17   ` Dietmar Eggemann
  2022-01-12 16:12 ` [PATCH v2 7/7] sched/fair: Remove the energy margin " Vincent Donnefort
  6 siblings, 2 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba,
	Vincent Donnefort

The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:

   effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.

If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:

  delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
                             effective_cpu_util(cpu_util))
        = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util

We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.

Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.

No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfc0d9b3eb19..efb2e0739015 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6548,61 +6548,77 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 }
 
 /*
- * compute_energy(): Estimates the energy that @pd would consume if @p was
- * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of @pd's CPUs after the task migration, and uses the Energy Model
- * to compute what would be the energy if we decided to actually migrate that
- * task.
+ * Compute the task busy time for compute_energy(). This time cannot be
+ * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * The latter only makes sense with the most recent CPUs where the task has
+ * run.
  */
-static long
-compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
-	       struct perf_domain *pd)
+static inline unsigned long
+get_task_busy_time(struct task_struct *p, int prev_cpu)
 {
-	unsigned long max_util = 0, sum_util = 0, cpu_cap;
-	int cpu;
+	unsigned long max_cap = arch_scale_cpu_capacity(prev_cpu);
+	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
 
-	cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
-	cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
+	if (unlikely(irq >= max_cap))
+		return max_cap;
+
+	return scale_irq_capacity(task_util_est(p), irq, max_cap);
+}
+
+/*
+ * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
+ * utilization for each @cpus, it however doesn't take into account clamping
+ * since the ratio (utilization /  cpu_capacity) is already enough to scale
+ * the EM reported power consumption at the (eventually clamped) cpu_capacity.
+ *
+ * The contribution of the task @p for which we want to estimate the energy
+ * cost is removed (by cpu_util_next()) and must be calculated separately (see
+ * get_task_busy_time). This ensures:
+ *
+ *   - A stable PD utilization, no matter which CPU of that PD we want to place
+ *     the task on.
+ *
+ *   - A fair comparison between CPUs as the task contribution (task_util())
+ *     will always be the same no matter which CPU utilization we rely on
+ *     (util_avg or util_est).
+ *
+ * Returns the busy time for the PD that spans @cpus. This busy time can't
+ * exceed @pd_cap.
+ */
+static inline unsigned long
+get_pd_busy_time(struct task_struct *p, struct cpumask *cpus,
+		 unsigned long pd_cap)
+{
+	unsigned long busy_time = 0;
+	int cpu;
 
-	/*
-	 * The capacity state of CPUs of the current rd can be driven by CPUs
-	 * of another rd if they belong to the same pd. So, account for the
-	 * utilization of these CPUs too by masking pd with cpu_online_mask
-	 * instead of the rd span.
-	 *
-	 * If an entire pd is outside of the current rd, it will not appear in
-	 * its pd list and will not be accounted by compute_energy().
-	 */
 	for_each_cpu(cpu, cpus) {
-		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
-		unsigned long cpu_util, util_running = util_freq;
-		struct task_struct *tsk = NULL;
+		unsigned long util = cpu_util_next(cpu, p, -1);
 
-		/*
-		 * When @p is placed on @cpu:
-		 *
-		 * util_running = max(cpu_util, cpu_util_est) +
-		 *		  max(task_util, _task_util_est)
-		 *
-		 * while cpu_util_next is: max(cpu_util + task_util,
-		 *			       cpu_util_est + _task_util_est)
-		 */
-		if (cpu == dst_cpu) {
-			tsk = p;
-			util_running =
-				cpu_util_next(cpu, p, -1) + task_util_est(p);
-		}
+		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+	}
 
-		/*
-		 * Busy time computation: utilization clamping is not
-		 * required since the ratio (sum_util / cpu_capacity)
-		 * is already enough to scale the EM reported power
-		 * consumption at the (eventually clamped) cpu_capacity.
-		 */
-		cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
-					      NULL);
+	return min(pd_cap, busy_time);
+}
+
+/*
+ * Compute the maximum utilization for compute_energy() when the task @p is
+ * placed on the cpu @dst_cpu.
+ *
+ * Returns the maximum utilization among @cpus. This utilization can't exceed
+ * @cpu_cap.
+ */
+static inline unsigned long
+get_pd_max_util(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
+		unsigned long cpu_cap)
+{
+	unsigned long max_util = 0;
+	int cpu;
 
-		sum_util += min(cpu_util, cpu_cap);
+	for_each_cpu(cpu, cpus) {
+		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
+		unsigned long util = cpu_util_next(cpu, p, dst_cpu);
+		unsigned long cpu_util;
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -6611,12 +6627,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
-					      tsk);
-		max_util = max(max_util, min(cpu_util, cpu_cap));
+		cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+		max_util = max(max_util, cpu_util);
 	}
 
-	return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
+	return min(max_util, cpu_cap);
+}
+
+/*
+ * compute_energy(): Use the Energy Model to estimate the energy that @pd would
+ * consume for a given utilization landscape. @max_util can be predicted with
+ * get_pd_max_util(), @busy_time can be predicted with get_task_busy_time() and
+ * get_pd_busy_time().
+ */
+static inline unsigned long
+compute_energy(struct perf_domain *pd, unsigned long max_util,
+	       unsigned long busy_time, unsigned long cpu_cap)
+{
+	return em_cpu_energy(pd->em_pd, max_util, busy_time, cpu_cap);
 }
 
 /*
@@ -6662,9 +6690,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
+	unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
 	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
 	int cpu, best_energy_cpu = prev_cpu, target = -1;
-	unsigned long cpu_cap, util, base_energy = 0;
+	unsigned long cpu_cap, cpu_thermal_cap, util;
+	unsigned long base_energy = 0;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
 
@@ -6689,6 +6719,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	if (!task_util_est(p))
 		goto unlock;
 
+	tsk_busy_time = get_task_busy_time(p, prev_cpu);
+
 	for (; pd; pd = pd->next) {
 		unsigned long cur_delta, spare_cap, max_spare_cap = 0;
 		bool compute_prev_delta = false;
@@ -6697,7 +6729,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
-		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+		/* Account thermal pressure for the energy estimation */
+		cpu = cpumask_first(cpus);
+		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
+		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+
+		for_each_cpu(cpu, cpus) {
+			pd_cap += cpu_thermal_cap;
+
+			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
+				continue;
+
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
@@ -6734,12 +6776,21 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			continue;
 
 		/* Compute the 'base' energy of the pd, without @p */
-		base_energy_pd = compute_energy(p, -1, cpus, pd);
+		busy_time = get_pd_busy_time(p, cpus, pd_cap);
+		max_util = get_pd_max_util(p, -1, cpus, cpu_thermal_cap);
+		base_energy_pd = compute_energy(pd, max_util, busy_time,
+						cpu_thermal_cap);
 		base_energy += base_energy_pd;
 
+		/* Take task into account for the next energy computations */
+		busy_time = min(pd_cap, busy_time + tsk_busy_time);
+
 		/* Evaluate the energy impact of using prev_cpu. */
 		if (compute_prev_delta) {
-			prev_delta = compute_energy(p, prev_cpu, cpus, pd);
+			max_util = get_pd_max_util(p, prev_cpu, cpus,
+						   cpu_thermal_cap);
+			prev_delta = compute_energy(pd, max_util, busy_time,
+						    cpu_thermal_cap);
 			if (prev_delta < base_energy_pd)
 				goto unlock;
 			prev_delta -= base_energy_pd;
@@ -6748,8 +6799,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 		/* Evaluate the energy impact of using max_spare_cap_cpu. */
 		if (max_spare_cap_cpu >= 0) {
-			cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
-						   pd);
+			max_util = get_pd_max_util(p, max_spare_cap_cpu, cpus,
+						   cpu_thermal_cap);
+			cur_delta = compute_energy(pd, max_util, busy_time,
+						   cpu_thermal_cap);
 			if (cur_delta < base_energy_pd)
 				goto unlock;
 			cur_delta -= base_energy_pd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 7/7] sched/fair: Remove the energy margin in feec()
  2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
                   ` (5 preceding siblings ...)
  2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
@ 2022-01-12 16:12 ` Vincent Donnefort
  6 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-12 16:12 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba,
	Vincent Donnefort

find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:

1. The energy estimation is not a good absolute value:

The function, compute_energy() used in feec() is a good estimation for
task placement as it allows to compare the energy with and without a task.
The computed delta will give a good overview of the cost for a certain
task placement. It, however, doesn't work as an absolute estimation for
the total energy of the system. First it adds the contribution to idle
CPUs into the energy, second it mixes util_avg with util_est values.
util_avg represents integrates the near history for a CPU usage,
it doesn't tell at all what the current utilization is. A system that has
been quite busy in the near past will hold a very high energy and then a
high margin preventing any task migration to a lower capacity CPU, wasting
energy. It even creates a negative feedback loop: by holding the tasks on
a less efficient CPU, the margin contributes in keeping the energy high.

2. The margin handicaps small tasks:

On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. e.g. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.

The 6% of the overall system energy was brought by the following patch:

 (eb92692b2544 sched/fair: Speed-up energy-aware wake-ups)

It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:

 (8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary)

We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)

Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue and even fixed previously known failures.

Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efb2e0739015..c8d100decbae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6694,7 +6694,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
 	int cpu, best_energy_cpu = prev_cpu, target = -1;
 	unsigned long cpu_cap, cpu_thermal_cap, util;
-	unsigned long base_energy = 0;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
 
@@ -6724,8 +6723,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	for (; pd; pd = pd->next) {
 		unsigned long cur_delta, spare_cap, max_spare_cap = 0;
 		bool compute_prev_delta = false;
-		unsigned long base_energy_pd;
 		int max_spare_cap_cpu = -1;
+		unsigned long base_energy;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -6778,9 +6777,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		/* Compute the 'base' energy of the pd, without @p */
 		busy_time = get_pd_busy_time(p, cpus, pd_cap);
 		max_util = get_pd_max_util(p, -1, cpus, cpu_thermal_cap);
-		base_energy_pd = compute_energy(pd, max_util, busy_time,
-						cpu_thermal_cap);
-		base_energy += base_energy_pd;
+		base_energy = compute_energy(pd, max_util, busy_time,
+					     cpu_thermal_cap);
 
 		/* Take task into account for the next energy computations */
 		busy_time = min(pd_cap, busy_time + tsk_busy_time);
@@ -6791,9 +6789,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 						   cpu_thermal_cap);
 			prev_delta = compute_energy(pd, max_util, busy_time,
 						    cpu_thermal_cap);
-			if (prev_delta < base_energy_pd)
+			if (prev_delta < base_energy)
 				goto unlock;
-			prev_delta -= base_energy_pd;
+			prev_delta -= base_energy;
 			best_delta = min(best_delta, prev_delta);
 		}
 
@@ -6803,9 +6801,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 						   cpu_thermal_cap);
 			cur_delta = compute_energy(pd, max_util, busy_time,
 						   cpu_thermal_cap);
-			if (cur_delta < base_energy_pd)
+			if (cur_delta < base_energy)
 				goto unlock;
-			cur_delta -= base_energy_pd;
+			cur_delta -= base_energy;
 			if (cur_delta < best_delta) {
 				best_delta = cur_delta;
 				best_energy_cpu = max_spare_cap_cpu;
@@ -6814,12 +6812,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	}
 	rcu_read_unlock();
 
-	/*
-	 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
-	 * least 6% of the energy used by prev_cpu.
-	 */
-	if ((prev_delta == ULONG_MAX) ||
-	    (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
+	if (best_delta < prev_delta)
 		target = best_energy_cpu;
 
 	return target;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec()
  2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
@ 2022-01-12 19:44     ` kernel test robot
  2022-01-17 13:17   ` Dietmar Eggemann
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-01-12 19:44 UTC (permalink / raw)
  To: Vincent Donnefort; +Cc: llvm, kbuild-all

Hi Vincent,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20220112]
[cannot apply to rafael-pm/linux-next rafael-pm/thermal v5.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Vincent-Donnefort/feec-energy-margin-removal/20220113-002104
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 82762d2af31a60081162890983a83499c9c7dd74
config: hexagon-randconfig-r045-20220112 (https://download.01.org/0day-ci/archive/20220113/202201130354.S8Z1unuB-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 244dd2913a43a200f5a6544d424cdc37b771028b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/ce70047d014b32af0102fca5681c1e8aebc4b7ae
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Vincent-Donnefort/feec-energy-margin-removal/20220113-002104
        git checkout ce70047d014b32af0102fca5681c1e8aebc4b7ae
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:6738:4: warning: variable 'pd_cap' is uninitialized when used here [-Wuninitialized]
                           pd_cap += cpu_thermal_cap;
                           ^~~~~~
   kernel/sched/fair.c:6693:58: note: initialize the variable 'pd_cap' to silence this warning
           unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
                                                                   ^
                                                                    = 0
   1 warning generated.


vim +/pd_cap +6738 kernel/sched/fair.c

  6649	
  6650	/*
  6651	 * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
  6652	 * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
  6653	 * spare capacity in each performance domain and uses it as a potential
  6654	 * candidate to execute the task. Then, it uses the Energy Model to figure
  6655	 * out which of the CPU candidates is the most energy-efficient.
  6656	 *
  6657	 * The rationale for this heuristic is as follows. In a performance domain,
  6658	 * all the most energy efficient CPU candidates (according to the Energy
  6659	 * Model) are those for which we'll request a low frequency. When there are
  6660	 * several CPUs for which the frequency request will be the same, we don't
  6661	 * have enough data to break the tie between them, because the Energy Model
  6662	 * only includes active power costs. With this model, if we assume that
  6663	 * frequency requests follow utilization (e.g. using schedutil), the CPU with
  6664	 * the maximum spare capacity in a performance domain is guaranteed to be among
  6665	 * the best candidates of the performance domain.
  6666	 *
  6667	 * In practice, it could be preferable from an energy standpoint to pack
  6668	 * small tasks on a CPU in order to let other CPUs go in deeper idle states,
  6669	 * but that could also hurt our chances to go cluster idle, and we have no
  6670	 * ways to tell with the current Energy Model if this is actually a good
  6671	 * idea or not. So, find_energy_efficient_cpu() basically favors
  6672	 * cluster-packing, and spreading inside a cluster. That should at least be
  6673	 * a good thing for latency, and this is consistent with the idea that most
  6674	 * of the energy savings of EAS come from the asymmetry of the system, and
  6675	 * not so much from breaking the tie between identical CPUs. That's also the
  6676	 * reason why EAS is enabled in the topology code only for systems where
  6677	 * SD_ASYM_CPUCAPACITY is set.
  6678	 *
  6679	 * NOTE: Forkees are not accepted in the energy-aware wake-up path because
  6680	 * they don't have any useful utilization data yet and it's not possible to
  6681	 * forecast their impact on energy consumption. Consequently, they will be
  6682	 * placed by find_idlest_cpu() on the least loaded CPU, which might turn out
  6683	 * to be energy-inefficient in some use-cases. The alternative would be to
  6684	 * bias new tasks towards specific types of CPUs first, or to try to infer
  6685	 * their util_avg from the parent task, but those heuristics could hurt
  6686	 * other use-cases too. So, until someone finds a better way to solve this,
  6687	 * let's keep things simple by re-using the existing slow path.
  6688	 */
  6689	static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  6690	{
  6691		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
  6692		unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
  6693		unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
  6694		struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
  6695		int cpu, best_energy_cpu = prev_cpu, target = -1;
  6696		unsigned long cpu_cap, cpu_thermal_cap, util;
  6697		unsigned long base_energy = 0;
  6698		struct sched_domain *sd;
  6699		struct perf_domain *pd;
  6700	
  6701		rcu_read_lock();
  6702		pd = rcu_dereference(rd->pd);
  6703		if (!pd || READ_ONCE(rd->overutilized))
  6704			goto unlock;
  6705	
  6706		/*
  6707		 * Energy-aware wake-up happens on the lowest sched_domain starting
  6708		 * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
  6709		 */
  6710		sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
  6711		while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
  6712			sd = sd->parent;
  6713		if (!sd)
  6714			goto unlock;
  6715	
  6716		target = prev_cpu;
  6717	
  6718		sync_entity_load_avg(&p->se);
  6719		if (!task_util_est(p))
  6720			goto unlock;
  6721	
  6722		tsk_busy_time = get_task_busy_time(p, prev_cpu);
  6723	
  6724		for (; pd; pd = pd->next) {
  6725			unsigned long cur_delta, spare_cap, max_spare_cap = 0;
  6726			bool compute_prev_delta = false;
  6727			unsigned long base_energy_pd;
  6728			int max_spare_cap_cpu = -1;
  6729	
  6730			cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
  6731	
  6732			/* Account thermal pressure for the energy estimation */
  6733			cpu = cpumask_first(cpus);
  6734			cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
  6735			cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
  6736	
  6737			for_each_cpu(cpu, cpus) {
> 6738				pd_cap += cpu_thermal_cap;
  6739	
  6740				if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
  6741					continue;
  6742	
  6743				if (!cpumask_test_cpu(cpu, p->cpus_ptr))
  6744					continue;
  6745	
  6746				util = cpu_util_next(cpu, p, cpu);
  6747				cpu_cap = capacity_of(cpu);
  6748				spare_cap = cpu_cap;
  6749				lsub_positive(&spare_cap, util);
  6750	
  6751				/*
  6752				 * Skip CPUs that cannot satisfy the capacity request.
  6753				 * IOW, placing the task there would make the CPU
  6754				 * overutilized. Take uclamp into account to see how
  6755				 * much capacity we can get out of the CPU; this is
  6756				 * aligned with sched_cpu_util().
  6757				 */
  6758				util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
  6759				if (!fits_capacity(util, cpu_cap))
  6760					continue;
  6761	
  6762				if (cpu == prev_cpu) {
  6763					/* Always use prev_cpu as a candidate. */
  6764					compute_prev_delta = true;
  6765				} else if (spare_cap > max_spare_cap) {
  6766					/*
  6767					 * Find the CPU with the maximum spare capacity
  6768					 * in the performance domain.
  6769					 */
  6770					max_spare_cap = spare_cap;
  6771					max_spare_cap_cpu = cpu;
  6772				}
  6773			}
  6774	
  6775			if (max_spare_cap_cpu < 0 && !compute_prev_delta)
  6776				continue;
  6777	
  6778			/* Compute the 'base' energy of the pd, without @p */
  6779			busy_time = get_pd_busy_time(p, cpus, pd_cap);
  6780			max_util = get_pd_max_util(p, -1, cpus, cpu_thermal_cap);
  6781			base_energy_pd = compute_energy(pd, max_util, busy_time,
  6782							cpu_thermal_cap);
  6783			base_energy += base_energy_pd;
  6784	
  6785			/* Take task into account for the next energy computations */
  6786			busy_time = min(pd_cap, busy_time + tsk_busy_time);
  6787	
  6788			/* Evaluate the energy impact of using prev_cpu. */
  6789			if (compute_prev_delta) {
  6790				max_util = get_pd_max_util(p, prev_cpu, cpus,
  6791							   cpu_thermal_cap);
  6792				prev_delta = compute_energy(pd, max_util, busy_time,
  6793							    cpu_thermal_cap);
  6794				if (prev_delta < base_energy_pd)
  6795					goto unlock;
  6796				prev_delta -= base_energy_pd;
  6797				best_delta = min(best_delta, prev_delta);
  6798			}
  6799	
  6800			/* Evaluate the energy impact of using max_spare_cap_cpu. */
  6801			if (max_spare_cap_cpu >= 0) {
  6802				max_util = get_pd_max_util(p, max_spare_cap_cpu, cpus,
  6803							   cpu_thermal_cap);
  6804				cur_delta = compute_energy(pd, max_util, busy_time,
  6805							   cpu_thermal_cap);
  6806				if (cur_delta < base_energy_pd)
  6807					goto unlock;
  6808				cur_delta -= base_energy_pd;
  6809				if (cur_delta < best_delta) {
  6810					best_delta = cur_delta;
  6811					best_energy_cpu = max_spare_cap_cpu;
  6812				}
  6813			}
  6814		}
  6815		rcu_read_unlock();
  6816	
  6817		/*
  6818		 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
  6819		 * least 6% of the energy used by prev_cpu.
  6820		 */
  6821		if ((prev_delta == ULONG_MAX) ||
  6822		    (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
  6823			target = best_energy_cpu;
  6824	
  6825		return target;
  6826	
  6827	unlock:
  6828		rcu_read_unlock();
  6829	
  6830		return target;
  6831	}
  6832	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec()
@ 2022-01-12 19:44     ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2022-01-12 19:44 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 10560 bytes --]

Hi Vincent,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20220112]
[cannot apply to rafael-pm/linux-next rafael-pm/thermal v5.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Vincent-Donnefort/feec-energy-margin-removal/20220113-002104
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 82762d2af31a60081162890983a83499c9c7dd74
config: hexagon-randconfig-r045-20220112 (https://download.01.org/0day-ci/archive/20220113/202201130354.S8Z1unuB-lkp(a)intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 244dd2913a43a200f5a6544d424cdc37b771028b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/ce70047d014b32af0102fca5681c1e8aebc4b7ae
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Vincent-Donnefort/feec-energy-margin-removal/20220113-002104
        git checkout ce70047d014b32af0102fca5681c1e8aebc4b7ae
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:6738:4: warning: variable 'pd_cap' is uninitialized when used here [-Wuninitialized]
                           pd_cap += cpu_thermal_cap;
                           ^~~~~~
   kernel/sched/fair.c:6693:58: note: initialize the variable 'pd_cap' to silence this warning
           unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
                                                                   ^
                                                                    = 0
   1 warning generated.


vim +/pd_cap +6738 kernel/sched/fair.c

  6649	
  6650	/*
  6651	 * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
  6652	 * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
  6653	 * spare capacity in each performance domain and uses it as a potential
  6654	 * candidate to execute the task. Then, it uses the Energy Model to figure
  6655	 * out which of the CPU candidates is the most energy-efficient.
  6656	 *
  6657	 * The rationale for this heuristic is as follows. In a performance domain,
  6658	 * all the most energy efficient CPU candidates (according to the Energy
  6659	 * Model) are those for which we'll request a low frequency. When there are
  6660	 * several CPUs for which the frequency request will be the same, we don't
  6661	 * have enough data to break the tie between them, because the Energy Model
  6662	 * only includes active power costs. With this model, if we assume that
  6663	 * frequency requests follow utilization (e.g. using schedutil), the CPU with
  6664	 * the maximum spare capacity in a performance domain is guaranteed to be among
  6665	 * the best candidates of the performance domain.
  6666	 *
  6667	 * In practice, it could be preferable from an energy standpoint to pack
  6668	 * small tasks on a CPU in order to let other CPUs go in deeper idle states,
  6669	 * but that could also hurt our chances to go cluster idle, and we have no
  6670	 * ways to tell with the current Energy Model if this is actually a good
  6671	 * idea or not. So, find_energy_efficient_cpu() basically favors
  6672	 * cluster-packing, and spreading inside a cluster. That should at least be
  6673	 * a good thing for latency, and this is consistent with the idea that most
  6674	 * of the energy savings of EAS come from the asymmetry of the system, and
  6675	 * not so much from breaking the tie between identical CPUs. That's also the
  6676	 * reason why EAS is enabled in the topology code only for systems where
  6677	 * SD_ASYM_CPUCAPACITY is set.
  6678	 *
  6679	 * NOTE: Forkees are not accepted in the energy-aware wake-up path because
  6680	 * they don't have any useful utilization data yet and it's not possible to
  6681	 * forecast their impact on energy consumption. Consequently, they will be
  6682	 * placed by find_idlest_cpu() on the least loaded CPU, which might turn out
  6683	 * to be energy-inefficient in some use-cases. The alternative would be to
  6684	 * bias new tasks towards specific types of CPUs first, or to try to infer
  6685	 * their util_avg from the parent task, but those heuristics could hurt
  6686	 * other use-cases too. So, until someone finds a better way to solve this,
  6687	 * let's keep things simple by re-using the existing slow path.
  6688	 */
  6689	static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  6690	{
  6691		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
  6692		unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
  6693		unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
  6694		struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
  6695		int cpu, best_energy_cpu = prev_cpu, target = -1;
  6696		unsigned long cpu_cap, cpu_thermal_cap, util;
  6697		unsigned long base_energy = 0;
  6698		struct sched_domain *sd;
  6699		struct perf_domain *pd;
  6700	
  6701		rcu_read_lock();
  6702		pd = rcu_dereference(rd->pd);
  6703		if (!pd || READ_ONCE(rd->overutilized))
  6704			goto unlock;
  6705	
  6706		/*
  6707		 * Energy-aware wake-up happens on the lowest sched_domain starting
  6708		 * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
  6709		 */
  6710		sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
  6711		while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
  6712			sd = sd->parent;
  6713		if (!sd)
  6714			goto unlock;
  6715	
  6716		target = prev_cpu;
  6717	
  6718		sync_entity_load_avg(&p->se);
  6719		if (!task_util_est(p))
  6720			goto unlock;
  6721	
  6722		tsk_busy_time = get_task_busy_time(p, prev_cpu);
  6723	
  6724		for (; pd; pd = pd->next) {
  6725			unsigned long cur_delta, spare_cap, max_spare_cap = 0;
  6726			bool compute_prev_delta = false;
  6727			unsigned long base_energy_pd;
  6728			int max_spare_cap_cpu = -1;
  6729	
  6730			cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
  6731	
  6732			/* Account thermal pressure for the energy estimation */
  6733			cpu = cpumask_first(cpus);
  6734			cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
  6735			cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
  6736	
  6737			for_each_cpu(cpu, cpus) {
> 6738				pd_cap += cpu_thermal_cap;
  6739	
  6740				if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
  6741					continue;
  6742	
  6743				if (!cpumask_test_cpu(cpu, p->cpus_ptr))
  6744					continue;
  6745	
  6746				util = cpu_util_next(cpu, p, cpu);
  6747				cpu_cap = capacity_of(cpu);
  6748				spare_cap = cpu_cap;
  6749				lsub_positive(&spare_cap, util);
  6750	
  6751				/*
  6752				 * Skip CPUs that cannot satisfy the capacity request.
  6753				 * IOW, placing the task there would make the CPU
  6754				 * overutilized. Take uclamp into account to see how
  6755				 * much capacity we can get out of the CPU; this is
  6756				 * aligned with sched_cpu_util().
  6757				 */
  6758				util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
  6759				if (!fits_capacity(util, cpu_cap))
  6760					continue;
  6761	
  6762				if (cpu == prev_cpu) {
  6763					/* Always use prev_cpu as a candidate. */
  6764					compute_prev_delta = true;
  6765				} else if (spare_cap > max_spare_cap) {
  6766					/*
  6767					 * Find the CPU with the maximum spare capacity
  6768					 * in the performance domain.
  6769					 */
  6770					max_spare_cap = spare_cap;
  6771					max_spare_cap_cpu = cpu;
  6772				}
  6773			}
  6774	
  6775			if (max_spare_cap_cpu < 0 && !compute_prev_delta)
  6776				continue;
  6777	
  6778			/* Compute the 'base' energy of the pd, without @p */
  6779			busy_time = get_pd_busy_time(p, cpus, pd_cap);
  6780			max_util = get_pd_max_util(p, -1, cpus, cpu_thermal_cap);
  6781			base_energy_pd = compute_energy(pd, max_util, busy_time,
  6782							cpu_thermal_cap);
  6783			base_energy += base_energy_pd;
  6784	
  6785			/* Take task into account for the next energy computations */
  6786			busy_time = min(pd_cap, busy_time + tsk_busy_time);
  6787	
  6788			/* Evaluate the energy impact of using prev_cpu. */
  6789			if (compute_prev_delta) {
  6790				max_util = get_pd_max_util(p, prev_cpu, cpus,
  6791							   cpu_thermal_cap);
  6792				prev_delta = compute_energy(pd, max_util, busy_time,
  6793							    cpu_thermal_cap);
  6794				if (prev_delta < base_energy_pd)
  6795					goto unlock;
  6796				prev_delta -= base_energy_pd;
  6797				best_delta = min(best_delta, prev_delta);
  6798			}
  6799	
  6800			/* Evaluate the energy impact of using max_spare_cap_cpu. */
  6801			if (max_spare_cap_cpu >= 0) {
  6802				max_util = get_pd_max_util(p, max_spare_cap_cpu, cpus,
  6803							   cpu_thermal_cap);
  6804				cur_delta = compute_energy(pd, max_util, busy_time,
  6805							   cpu_thermal_cap);
  6806				if (cur_delta < base_energy_pd)
  6807					goto unlock;
  6808				cur_delta -= base_energy_pd;
  6809				if (cur_delta < best_delta) {
  6810					best_delta = cur_delta;
  6811					best_energy_cpu = max_spare_cap_cpu;
  6812				}
  6813			}
  6814		}
  6815		rcu_read_unlock();
  6816	
  6817		/*
  6818		 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
  6819		 * least 6% of the energy used by prev_cpu.
  6820		 */
  6821		if ((prev_delta == ULONG_MAX) ||
  6822		    (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
  6823			target = best_energy_cpu;
  6824	
  6825		return target;
  6826	
  6827	unlock:
  6828		rcu_read_unlock();
  6829	
  6830		return target;
  6831	}
  6832	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec()
  2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
  2022-01-12 19:44     ` kernel test robot
@ 2022-01-17 13:17   ` Dietmar Eggemann
  2022-01-18  9:46     ` Vincent Donnefort
  1 sibling, 1 reply; 22+ messages in thread
From: Dietmar Eggemann @ 2022-01-17 13:17 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, valentin.schneider, morten.rasmussen,
	chris.redpath, qperret, lukasz.luba

On 12/01/2022 17:12, Vincent Donnefort wrote:

[...]

> +static inline unsigned long
> +get_pd_busy_time(struct task_struct *p, struct cpumask *cpus,
> +		 unsigned long pd_cap)
> +{
> +	unsigned long busy_time = 0;
> +	int cpu;
>  
> -	/*
> -	 * The capacity state of CPUs of the current rd can be driven by CPUs
> -	 * of another rd if they belong to the same pd. So, account for the
> -	 * utilization of these CPUs too by masking pd with cpu_online_mask
> -	 * instead of the rd span.
> -	 *
> -	 * If an entire pd is outside of the current rd, it will not appear in
> -	 * its pd list and will not be accounted by compute_energy().
> -	 */
>  	for_each_cpu(cpu, cpus) {
> -		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> -		unsigned long cpu_util, util_running = util_freq;
> -		struct task_struct *tsk = NULL;
> +		unsigned long util = cpu_util_next(cpu, p, -1);
>  
> -		/*
> -		 * When @p is placed on @cpu:
> -		 *
> -		 * util_running = max(cpu_util, cpu_util_est) +
> -		 *		  max(task_util, _task_util_est)
> -		 *
> -		 * while cpu_util_next is: max(cpu_util + task_util,
> -		 *			       cpu_util_est + _task_util_est)
> -		 */
> -		if (cpu == dst_cpu) {
> -			tsk = p;
> -			util_running =
> -				cpu_util_next(cpu, p, -1) + task_util_est(p);
> -		}
> +		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
> +	}
>  
> -		/*
> -		 * Busy time computation: utilization clamping is not
> -		 * required since the ratio (sum_util / cpu_capacity)
> -		 * is already enough to scale the EM reported power
> -		 * consumption at the (eventually clamped) cpu_capacity.
> -		 */
> -		cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
> -					      NULL);
> +	return min(pd_cap, busy_time);

You're capping the busy_time (sum of effective_cpu_util() of CPUs in
cpus) by pd capacity (cpumask_weight(cpus) * cpu_thermal_cap).

Before, each effective_cpu_util() was capped by cpu_thermal_cap
individually: sum_util += min(effective_cpu_util(), cpu_thermal_cap)

Why did you change that? Because of the way you calculate busy time with
the task: busy_time = min(pd_cap, busy_time + tsk_busy_time) ?

[...]

> @@ -6662,9 +6690,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>  	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> +	unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
>  	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
>  	int cpu, best_energy_cpu = prev_cpu, target = -1;
> -	unsigned long cpu_cap, util, base_energy = 0;
> +	unsigned long cpu_cap, cpu_thermal_cap, util;
> +	unsigned long base_energy = 0;
>  	struct sched_domain *sd;
>  	struct perf_domain *pd;
>  
> @@ -6689,6 +6719,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	if (!task_util_est(p))
>  		goto unlock;
>  
> +	tsk_busy_time = get_task_busy_time(p, prev_cpu);
> +
>  	for (; pd; pd = pd->next) {
>  		unsigned long cur_delta, spare_cap, max_spare_cap = 0;
>  		bool compute_prev_delta = false;
> @@ -6697,7 +6729,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  
>  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>  
> -		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> +		/* Account thermal pressure for the energy estimation */
> +		cpu = cpumask_first(cpus);
> +		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
> +		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
> +
> +		for_each_cpu(cpu, cpus) {
> +			pd_cap += cpu_thermal_cap;
> +
> +			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
> +				continue;
> +
>  			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>  				continue;
>  
> @@ -6734,12 +6776,21 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			continue;
>  
>  		/* Compute the 'base' energy of the pd, without @p */
> -		base_energy_pd = compute_energy(p, -1, cpus, pd);
> +		busy_time = get_pd_busy_time(p, cpus, pd_cap);
> +		max_util = get_pd_max_util(p, -1, cpus, cpu_thermal_cap);

There is this issue now that we would iterate twice now over `cpus`
here. To avoid this, I can only see the solution to introduce a

    struct eas_env {
           unsigned long max_util;      (1)
           unsigned long busy_time;     (2)
           unsigned long busy_tsk_time; (3)
           ...
    }

replace get_pd_busy_time() and get_pd_max_util() with

    get_energy_params(struct eas_env *env, ...)

and make sure that (1)-(3) are calculated and returned here whereas only
(1) is later for `if (compute_prev_delta)` and `if (max_spare_cap_cpu >=
0)`. E.g. by passing this switch with the env.
This would allow the keep pd_cap within get_energy_params(). W/o struct
eas_env, IMHO this function ends up with too many parameters.

That said, I haven't seen asymmetric CPU capacity processors with more
than 6 CPUs in one PD (i.e. Frequency Domain)

[...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
@ 2022-01-17 16:11   ` Tao Zhou
  2022-01-17 19:42     ` Vincent Donnefort
  0 siblings, 1 reply; 22+ messages in thread
From: Tao Zhou @ 2022-01-17 16:11 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	valentin.schneider, morten.rasmussen, chris.redpath, qperret,
	lukasz.luba, Tao Zhou

Hi,

On Wed, Jan 12, 2022 at 04:12:24PM +0000, Vincent Donnefort wrote:

> Introducing macro helpers u64_u32_{store,load}() to factorize lockless
> accesses to u64 variables for 32-bits architectures.
> 
> Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
> accommodate the later where the copy lies outside of the structure
> (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
> use the _copy() version of those helpers.
> 
> Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
> therefore, have a small penalty in set_task_rq_fair() and init_cfs_rq().
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..99ea9540ece4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -568,11 +568,8 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  	}
>  
>  	/* ensure we never gain time by being placed backwards. */
> -	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
> -#ifndef CONFIG_64BIT
> -	smp_wmb();
> -	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
> -#endif
> +	u64_u32_store(cfs_rq->min_vruntime,
> +		      max_vruntime(cfs_rq->min_vruntime, vruntime));
>  }
>  
>  static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
> @@ -3246,6 +3243,11 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
>  }
>  
>  #ifdef CONFIG_SMP
> +static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> +{
> +	return u64_u32_load_copy(cfs_rq->avg.last_update_time,
> +				 cfs_rq->last_update_time_copy);
> +}
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
>   * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
> @@ -3356,27 +3358,9 @@ void set_task_rq_fair(struct sched_entity *se,
>  	if (!(se->avg.last_update_time && prev))
>  		return;
>  
> -#ifndef CONFIG_64BIT
> -	{
> -		u64 p_last_update_time_copy;
> -		u64 n_last_update_time_copy;
> -
> -		do {
> -			p_last_update_time_copy = prev->load_last_update_time_copy;
> -			n_last_update_time_copy = next->load_last_update_time_copy;
> -
> -			smp_rmb();
> +	p_last_update_time = cfs_rq_last_update_time(prev);
> +	n_last_update_time = cfs_rq_last_update_time(next);
>  
> -			p_last_update_time = prev->avg.last_update_time;
> -			n_last_update_time = next->avg.last_update_time;
> -
> -		} while (p_last_update_time != p_last_update_time_copy ||
> -			 n_last_update_time != n_last_update_time_copy);
> -	}
> -#else
> -	p_last_update_time = prev->avg.last_update_time;
> -	n_last_update_time = next->avg.last_update_time;
> -#endif
>  	__update_load_avg_blocked_se(p_last_update_time, se);
>  	se->avg.last_update_time = n_last_update_time;
>  }
> @@ -3700,8 +3684,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
>  
>  #ifndef CONFIG_64BIT
> -	smp_wmb();
> -	cfs_rq->load_last_update_time_copy = sa->last_update_time;
> +	u64_u32_store_copy(sa->last_update_time,
> +			   cfs_rq->last_update_time_copy,
> +			   sa->last_update_time);
>  #endif
>  
>  	return decayed;
> @@ -3834,27 +3819,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	}
>  }
>  
> -#ifndef CONFIG_64BIT
> -static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> -{
> -	u64 last_update_time_copy;
> -	u64 last_update_time;
> -
> -	do {
> -		last_update_time_copy = cfs_rq->load_last_update_time_copy;
> -		smp_rmb();
> -		last_update_time = cfs_rq->avg.last_update_time;
> -	} while (last_update_time != last_update_time_copy);
> -
> -	return last_update_time;
> -}
> -#else
> -static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> -{
> -	return cfs_rq->avg.last_update_time;
> -}
> -#endif
> -
>  /*
>   * Synchronize entity load avg of dequeued entity without locking
>   * the previous rq.
> @@ -6904,21 +6868,8 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  	if (READ_ONCE(p->__state) == TASK_WAKING) {
>  		struct sched_entity *se = &p->se;
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -		u64 min_vruntime;
>  
> -#ifndef CONFIG_64BIT
> -		u64 min_vruntime_copy;
> -
> -		do {
> -			min_vruntime_copy = cfs_rq->min_vruntime_copy;
> -			smp_rmb();
> -			min_vruntime = cfs_rq->min_vruntime;
> -		} while (min_vruntime != min_vruntime_copy);
> -#else
> -		min_vruntime = cfs_rq->min_vruntime;
> -#endif
> -
> -		se->vruntime -= min_vruntime;
> +		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
>  	}
>  
>  	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
> @@ -11362,10 +11313,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>  void init_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> -	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
> -#ifndef CONFIG_64BIT
> -	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
> -#endif
> +	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
>  #ifdef CONFIG_SMP
>  	raw_spin_lock_init(&cfs_rq->removed.lock);
>  #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de53be905739..f1a445efdc63 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -528,6 +528,45 @@ struct cfs_bandwidth { };
>  
>  #endif	/* CONFIG_CGROUP_SCHED */
>  
> +/*
> + * u64_u32_load/u64_u32_store
> + *
> + * Use a copy of a u64 value to protect against data race. This is only
> + * applicable for 32-bits architectures.
> + */
> +#ifdef CONFIG_64BIT
> +# define u64_u32_load_copy(var, copy)       var
> +# define u64_u32_store_copy(var, copy, val) (var = val)
> +#else
> +# define u64_u32_load_copy(var, copy)					\
> +({									\
> +	u64 __val, __val_copy;						\
> +	do {								\
> +		__val_copy = copy;					\
> +		/*							\
> +		 * paired with u64_u32_store, ordering access		\
> +		 * to var and copy.					\
> +		 */							\
> +		smp_rmb();						\
> +		__val = var;						\
> +	} while (__val != __val_copy);					\
> +	__val;								\
> +})
> +# define u64_u32_store_copy(var, copy, val)				\
> +do {									\
> +	typeof(val) __val = (val);					\
> +	var = __val;							\
> +	/*								\
> +	 * paired with u64_u32_load, ordering access to var and		\
> +	 * copy.							\
> +	 */								\
> +	smp_wmb();							\
> +	copy = __val;							\
> +} while (0)

Code stay there some time from me. Just from my crude review;
The above macro need a variable to load @var temporarily for
later store; that means the @copy value is from @var not @val.

  # define u64_u32_store_copy(var, copy, val)				\
  do {									\
    typeof(val) __val = (val), __var = (var);					\
    var = __val;							\
    /*								\
     * paired with u64_u32_load, ordering access to var and		\
     * copy.							\
     */								\
    smp_wmb();							\
    copy = __var;							\
  } while (0)



Thanks,
Tao
> +#endif
> +# define u64_u32_load(var)      u64_u32_load_copy(var, var##_copy)
> +# define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
> +
>  /* CFS-related fields in a runqueue */
>  struct cfs_rq {
>  	struct load_weight	load;
> @@ -568,7 +607,7 @@ struct cfs_rq {
>  	 */
>  	struct sched_avg	avg;
>  #ifndef CONFIG_64BIT
> -	u64			load_last_update_time_copy;
> +	u64			last_update_time_copy;
>  #endif
>  	struct {
>  		raw_spinlock_t	lock ____cacheline_aligned;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-12 16:12 ` [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Vincent Donnefort
@ 2022-01-17 17:31   ` Vincent Guittot
  2022-01-18 10:56     ` Vincent Donnefort
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2022-01-17 17:31 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

On Wed, 12 Jan 2022 at 17:14, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> Before being migrated to a new CPU, a task sees its PELT values
> synchronized with rq last_update_time. Once done, that same task will also
> have its sched_avg last_update_time reset. This means the time between
> the migration and the last clock update (B) will not be accounted for in
> util_avg and a discontinuity will appear. This issue is amplified by the
> PELT clock scaling. If the clock hasn't been updated while the CPU is
> idle, clock_pelt will not be aligned with clock_task and that time (A)
> will be also lost.
>
>    ---------|----- A -----|-----------|------- B -----|>
>         clock_pelt   clock_task     clock            now
>
> This is especially problematic for asymmetric CPU capacity systems which
> need stable util_avg signals for task placement and energy estimation.
>
> Ideally, this problem would be solved by updating the runqueue clocks
> before the migration. But that would require taking the runqueue lock
> which is quite expensive [1]. Instead estimate the missing time and update
> the task util_avg with that value:
>
>   A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>
> Neither clock_task, clock_pelt nor clock can be accessed without the
> runqueue lock. The new runqueue clock_pelt_lag is therefore created and
> encode those three values.
>
>   clock_pelt_lag = clock - clock_task + clock_pelt
>
> And we can then write the missing time as follow:
>
>   A + B = sched_clock_cpu() - clock_pelt_lag
>
> The B. part of the missing time is however an estimation that doesn't take
> into account IRQ and Paravirt time.
>
> Now we have an estimation for A + B, we can create an estimator for the
> PELT value at the time of the migration. We need for this purpose to
> inject last_update_time which is a combination of both clock_pelt and
> lost_idle_time. The latter is a time value which is completely lost form a
> PELT point of view and must be ignored. And finally, we can write:
>
>   rq_clock_pelt_estimator() = last_update_time + A + B
>                             = last_update_time +
>                                    sched_clock_cpu() - clock_pelt_lag
>
> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06cf7620839a..11c6aeef4583 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -618,6 +618,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
>         }
>  }
>
> +static void update_rq_clock_pelt_lag(struct rq *rq)
> +{
> +       u64_u32_store(rq->clock_pelt_lag,
> +                     rq->clock - rq->clock_task + rq->clock_pelt);

This has several shortfalls:
- have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
name clock_pelt in your commit message and is used to update PELT and
saved in se->avg.last_update_time is : rq->clock_pelt -
rq->lost_idle_time - cfs_rq->throttled_clock_task_time
- you are doing this whatever the state of the cpu : idle or not. But
the clock cycles are not accounted for in the same way in both cases.
- (B) doesn't seem to be accurate as you skip irq and steal time
accounting and you don't apply any scale invariance if the cpu is not
idle
- IIUC your explanation in the commit message above, the (A) period
seems to be a problem only when idle but you apply it unconditionally.
If cpu is idle you can assume that clock_pelt should be equal to
clock_task but you can't if cpu is not idle otherwise your sync will
be inaccurate and defeat the primary goal of this patch. If your
problem with clock_pelt is that the pending idle time is not accounted
for when entering idle but only at the next update (update blocked
load or wakeup of a thread). This patch below should fix this and
remove your A.


diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index e06071bf3472..855877be4dd8 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -114,6 +114,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 {
        u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT)
- LOAD_AVG_MAX;
        u32 util_sum = rq->cfs.avg.util_sum;
+       u64 now = rq_clock_task(rq);
        util_sum += rq->avg_rt.util_sum;
        util_sum += rq->avg_dl.util_sum;

@@ -127,7 +128,10 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
         * rq's clock_task.
         */
        if (util_sum >= divider)
-               rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+               rq->lost_idle_time += now - rq->clock_pelt;
+
+       /* The rq is idle, we can sync to clock_task */
+       rq->clock_pelt  = now;
 }

 static inline u64 rq_clock_pelt(struct rq *rq)

---


> +}
> +
>  /*
>   * RQ-clock updating methods:
>   */
> @@ -674,6 +680,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>                 update_irq_load_avg(rq, irq_delta + steal);
>  #endif
>         update_rq_clock_pelt(rq, delta);
> +       update_rq_clock_pelt_lag(rq);
>  }
>
>  void update_rq_clock(struct rq *rq)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99ea9540ece4..046d5397eb8a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6852,6 +6852,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>
>  static void detach_entity_cfs_rq(struct sched_entity *se);
>
> +static u64 rq_clock_pelt_estimator(struct rq *rq, u64 last_update_time)
> +{
> +       u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
> +                      u64_u32_load(rq->clock_pelt_lag);

Have you evaluated the impact of calling sched_clock_cpu(cpu_of(rq))
for a remote cpu ? especially with a huge number of migration and
concurrent access from several cpus

> +
> +       return last_update_time + pelt_lag;
> +}
> +
>  /*
>   * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
>   * cfs_rq_of(p) references at time of call are still valid and identify the
> @@ -6859,6 +6867,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
>   */
>  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  {
> +       struct sched_entity *se = &p->se;
> +       struct rq *rq = task_rq(p);
> +
>         /*
>          * As blocked tasks retain absolute vruntime the migration needs to
>          * deal with this by subtracting the old and adding the new
> @@ -6866,7 +6877,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>          * the task on the new runqueue.
>          */
>         if (READ_ONCE(p->__state) == TASK_WAKING) {
> -               struct sched_entity *se = &p->se;
>                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>                 se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> @@ -6877,26 +6887,32 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>                  * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
>                  * rq->lock and can modify state directly.
>                  */
> -               lockdep_assert_rq_held(task_rq(p));
> -               detach_entity_cfs_rq(&p->se);
> +               lockdep_assert_rq_held(rq);
> +               detach_entity_cfs_rq(se);
>
>         } else {
> +               u64 now;
> +
> +               remove_entity_load_avg(se);
> +
>                 /*
> -                * We are supposed to update the task to "current" time, then
> -                * its up to date and ready to go to new CPU/cfs_rq. But we
> -                * have difficulty in getting what current time is, so simply
> -                * throw away the out-of-date time. This will result in the
> -                * wakee task is less decayed, but giving the wakee more load
> -                * sounds not bad.
> +                * Here, the task's PELT values have been updated according to
> +                * the current rq's clock. But if that clock hasn't been
> +                * updated in a while, a substantial idle time will be missed,
> +                * leading to an inflation after wake-up on the new rq.
> +                *
> +                * Estimate the PELT clock lag, and update sched_avg to ensure
> +                * PELT continuity after migration.
>                  */
> -               remove_entity_load_avg(&p->se);
> +               now = rq_clock_pelt_estimator(rq, se->avg.last_update_time);
> +               __update_load_avg_blocked_se(now, se);
>         }
>
>         /* Tell new CPU we are migrated */
> -       p->se.avg.last_update_time = 0;
> +       se->avg.last_update_time = 0;
>
>         /* We have migrated, no longer consider this task hot */
> -       p->se.exec_start = 0;
> +       se->exec_start = 0;
>
>         update_scan_period(p, new_cpu);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f1a445efdc63..fdf2a9e54c0e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1027,8 +1027,13 @@ struct rq {
>         /* Ensure that all clocks are in the same cache line */
>         u64                     clock_task ____cacheline_aligned;
>         u64                     clock_pelt;
> +       u64                     clock_pelt_lag;
>         unsigned long           lost_idle_time;
>
> +#ifndef CONFIG_64BIT
> +       u64                     clock_pelt_lag_copy;
> +#endif
> +
>         atomic_t                nr_iowait;
>
>  #ifdef CONFIG_SCHED_DEBUG
> --
> 2.25.1
>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-01-17 16:11   ` Tao Zhou
@ 2022-01-17 19:42     ` Vincent Donnefort
  2022-01-17 23:44       ` Tao Zhou
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-17 19:42 UTC (permalink / raw)
  To: Tao Zhou
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	valentin.schneider, morten.rasmussen, chris.redpath, qperret,
	lukasz.luba

[...]

> > +/*
> > + * u64_u32_load/u64_u32_store
> > + *
> > + * Use a copy of a u64 value to protect against data race. This is only
> > + * applicable for 32-bits architectures.
> > + */
> > +#ifdef CONFIG_64BIT
> > +# define u64_u32_load_copy(var, copy)       var
> > +# define u64_u32_store_copy(var, copy, val) (var = val)
> > +#else
> > +# define u64_u32_load_copy(var, copy)					\
> > +({									\
> > +	u64 __val, __val_copy;						\
> > +	do {								\
> > +		__val_copy = copy;					\
> > +		/*							\
> > +		 * paired with u64_u32_store, ordering access		\
> > +		 * to var and copy.					\
> > +		 */							\
> > +		smp_rmb();						\
> > +		__val = var;						\
> > +	} while (__val != __val_copy);					\
> > +	__val;								\
> > +})
> > +# define u64_u32_store_copy(var, copy, val)				\
> > +do {									\
> > +	typeof(val) __val = (val);					\
> > +	var = __val;							\
> > +	/*								\
> > +	 * paired with u64_u32_load, ordering access to var and		\
> > +	 * copy.							\
> > +	 */								\
> > +	smp_wmb();							\
> > +	copy = __val;							\
> > +} while (0)
> 
> Code stay there some time from me. Just from my crude review;
> The above macro need a variable to load @var temporarily for
> later store; that means the @copy value is from @var not @val.
> 
>   # define u64_u32_store_copy(var, copy, val)				\
>   do {									\
>     typeof(val) __val = (val), __var = (var);					\
>     var = __val;							\
>     /*								\
>      * paired with u64_u32_load, ordering access to var and		\
>      * copy.							\
>      */								\
>     smp_wmb();							\
>     copy = __var;							\
>   } while (0)


Hi Tao,

__var would then contain the previous value of @var, wouldn't it? We need
both @var and @copy to be equal to @val.

> 
> 
> 
> Thanks,
> Tao


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper
  2022-01-17 19:42     ` Vincent Donnefort
@ 2022-01-17 23:44       ` Tao Zhou
  0 siblings, 0 replies; 22+ messages in thread
From: Tao Zhou @ 2022-01-17 23:44 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	valentin.schneider, morten.rasmussen, chris.redpath, qperret,
	lukasz.luba, Tao Zhou

On Mon, Jan 17, 2022 at 07:42:04PM +0000, Vincent Donnefort wrote:
> [...]
> 
> > > +/*
> > > + * u64_u32_load/u64_u32_store
> > > + *
> > > + * Use a copy of a u64 value to protect against data race. This is only
> > > + * applicable for 32-bits architectures.
> > > + */
> > > +#ifdef CONFIG_64BIT
> > > +# define u64_u32_load_copy(var, copy)       var
> > > +# define u64_u32_store_copy(var, copy, val) (var = val)
> > > +#else
> > > +# define u64_u32_load_copy(var, copy)					\
> > > +({									\
> > > +	u64 __val, __val_copy;						\
> > > +	do {								\
> > > +		__val_copy = copy;					\
> > > +		/*							\
> > > +		 * paired with u64_u32_store, ordering access		\
> > > +		 * to var and copy.					\
> > > +		 */							\
> > > +		smp_rmb();						\
> > > +		__val = var;						\
> > > +	} while (__val != __val_copy);					\
> > > +	__val;								\
> > > +})
> > > +# define u64_u32_store_copy(var, copy, val)				\
> > > +do {									\
> > > +	typeof(val) __val = (val);					\
> > > +	var = __val;							\
> > > +	/*								\
> > > +	 * paired with u64_u32_load, ordering access to var and		\
> > > +	 * copy.							\
> > > +	 */								\
> > > +	smp_wmb();							\
> > > +	copy = __val;							\
> > > +} while (0)
> > 
> > Code stay there some time from me. Just from my crude review;
> > The above macro need a variable to load @var temporarily for
> > later store; that means the @copy value is from @var not @val.
> > 
> >   # define u64_u32_store_copy(var, copy, val)				\
> >   do {									\
> >     typeof(val) __val = (val), __var = (var);					\
> >     var = __val;							\
> >     /*								\
> >      * paired with u64_u32_load, ordering access to var and		\
> >      * copy.							\
> >      */								\
> >     smp_wmb();							\
> >     copy = __var;							\
> >   } while (0)
> 
> 
> Hi Tao,
> 
> __var would then contain the previous value of @var, wouldn't it? We need
> both @var and @copy to be equal to @val.

Sorry for the noise, I'm wrong here.

> 
> > 
> > 
> > 
> > Thanks,
> > Tao

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec()
  2022-01-17 13:17   ` Dietmar Eggemann
@ 2022-01-18  9:46     ` Vincent Donnefort
  0 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-18  9:46 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, vincent.guittot, linux-kernel, valentin.schneider,
	morten.rasmussen, chris.redpath, qperret, lukasz.luba

On Mon, Jan 17, 2022 at 02:17:55PM +0100, Dietmar Eggemann wrote:
> On 12/01/2022 17:12, Vincent Donnefort wrote:
> 
> [...]
> 
> > +static inline unsigned long
> > +get_pd_busy_time(struct task_struct *p, struct cpumask *cpus,
> > +		 unsigned long pd_cap)
> > +{
> > +	unsigned long busy_time = 0;
> > +	int cpu;
> >  
> > -	/*
> > -	 * The capacity state of CPUs of the current rd can be driven by CPUs
> > -	 * of another rd if they belong to the same pd. So, account for the
> > -	 * utilization of these CPUs too by masking pd with cpu_online_mask
> > -	 * instead of the rd span.
> > -	 *
> > -	 * If an entire pd is outside of the current rd, it will not appear in
> > -	 * its pd list and will not be accounted by compute_energy().
> > -	 */
> >  	for_each_cpu(cpu, cpus) {
> > -		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> > -		unsigned long cpu_util, util_running = util_freq;
> > -		struct task_struct *tsk = NULL;
> > +		unsigned long util = cpu_util_next(cpu, p, -1);
> >  
> > -		/*
> > -		 * When @p is placed on @cpu:
> > -		 *
> > -		 * util_running = max(cpu_util, cpu_util_est) +
> > -		 *		  max(task_util, _task_util_est)
> > -		 *
> > -		 * while cpu_util_next is: max(cpu_util + task_util,
> > -		 *			       cpu_util_est + _task_util_est)
> > -		 */
> > -		if (cpu == dst_cpu) {
> > -			tsk = p;
> > -			util_running =
> > -				cpu_util_next(cpu, p, -1) + task_util_est(p);
> > -		}
> > +		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
> > +	}
> >  
> > -		/*
> > -		 * Busy time computation: utilization clamping is not
> > -		 * required since the ratio (sum_util / cpu_capacity)
> > -		 * is already enough to scale the EM reported power
> > -		 * consumption at the (eventually clamped) cpu_capacity.
> > -		 */
> > -		cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
> > -					      NULL);
> > +	return min(pd_cap, busy_time);
> 
> You're capping the busy_time (sum of effective_cpu_util() of CPUs in
> cpus) by pd capacity (cpumask_weight(cpus) * cpu_thermal_cap).
> 
> Before, each effective_cpu_util() was capped by cpu_thermal_cap
> individually: sum_util += min(effective_cpu_util(), cpu_thermal_cap)
> 
> Why did you change that? Because of the way you calculate busy time with
> the task: busy_time = min(pd_cap, busy_time + tsk_busy_time) ?

It avoids having to cap each CPU separately and also aligns with task_busy_time.
I guess we could argue this isn't the most accurate solution but without taking
this shortcut, we'd have to walk through all the CPUs again for the busy time
computation when testing the task placement. :/

But now reading it again makes me feel I might have not taken the right decision
and we'd prefer not fall into the case where the utilization of a single CPU is
too high but the global PD's is not. 

> 
> [...]
> 
> > @@ -6662,9 +6690,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >  {
> >  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> >  	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> > +	unsigned long busy_time, tsk_busy_time, max_util, pd_cap;
> >  	struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
> >  	int cpu, best_energy_cpu = prev_cpu, target = -1;
> > -	unsigned long cpu_cap, util, base_energy = 0;
> > +	unsigned long cpu_cap, cpu_thermal_cap, util;
> > +	unsigned long base_energy = 0;
> >  	struct sched_domain *sd;
> >  	struct perf_domain *pd;
> >  
> > @@ -6689,6 +6719,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >  	if (!task_util_est(p))
> >  		goto unlock;
> >  
> > +	tsk_busy_time = get_task_busy_time(p, prev_cpu);
> > +
> >  	for (; pd; pd = pd->next) {
> >  		unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> >  		bool compute_prev_delta = false;
> > @@ -6697,7 +6729,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >  
> >  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> >  
> > -		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > +		/* Account thermal pressure for the energy estimation */
> > +		cpu = cpumask_first(cpus);
> > +		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
> > +		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
> > +
> > +		for_each_cpu(cpu, cpus) {
> > +			pd_cap += cpu_thermal_cap;
> > +
> > +			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
> > +				continue;
> > +
> >  			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> >  				continue;
> >  
> > @@ -6734,12 +6776,21 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >  			continue;
> >  
> >  		/* Compute the 'base' energy of the pd, without @p */
> > -		base_energy_pd = compute_energy(p, -1, cpus, pd);
> > +		busy_time = get_pd_busy_time(p, cpus, pd_cap);
> > +		max_util = get_pd_max_util(p, -1, cpus, cpu_thermal_cap);
> 
> There is this issue now that we would iterate twice now over `cpus`
> here. To avoid this, I can only see the solution to introduce a
> 
>     struct eas_env {
>            unsigned long max_util;      (1)
>            unsigned long busy_time;     (2)
>            unsigned long busy_tsk_time; (3)
>            ...
>     }
> 
> replace get_pd_busy_time() and get_pd_max_util() with
> 
>     get_energy_params(struct eas_env *env, ...)

That'd be cleaner yeah, I'll give a try for a next version.

Thanks.

> 
> and make sure that (1)-(3) are calculated and returned here whereas only
> (1) is later for `if (compute_prev_delta)` and `if (max_spare_cap_cpu >=
> 0)`. E.g. by passing this switch with the env.
> This would allow the keep pd_cap within get_energy_params(). W/o struct
> eas_env, IMHO this function ends up with too many parameters.
> 
> That said, I haven't seen asymmetric CPU capacity processors with more
> than 6 CPUs in one PD (i.e. Frequency Domain)
> 
> [...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-17 17:31   ` Vincent Guittot
@ 2022-01-18 10:56     ` Vincent Donnefort
  2022-01-19  9:54       ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-18 10:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

On Mon, Jan 17, 2022 at 06:31:25PM +0100, Vincent Guittot wrote:
> On Wed, 12 Jan 2022 at 17:14, Vincent Donnefort
> <vincent.donnefort@arm.com> wrote:
> >
> > Before being migrated to a new CPU, a task sees its PELT values
> > synchronized with rq last_update_time. Once done, that same task will also
> > have its sched_avg last_update_time reset. This means the time between
> > the migration and the last clock update (B) will not be accounted for in
> > util_avg and a discontinuity will appear. This issue is amplified by the
> > PELT clock scaling. If the clock hasn't been updated while the CPU is
> > idle, clock_pelt will not be aligned with clock_task and that time (A)
> > will be also lost.
> >
> >    ---------|----- A -----|-----------|------- B -----|>
> >         clock_pelt   clock_task     clock            now
> >
> > This is especially problematic for asymmetric CPU capacity systems which
> > need stable util_avg signals for task placement and energy estimation.
> >
> > Ideally, this problem would be solved by updating the runqueue clocks
> > before the migration. But that would require taking the runqueue lock
> > which is quite expensive [1]. Instead estimate the missing time and update
> > the task util_avg with that value:
> >
> >   A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
> >
> > Neither clock_task, clock_pelt nor clock can be accessed without the
> > runqueue lock. The new runqueue clock_pelt_lag is therefore created and
> > encode those three values.
> >
> >   clock_pelt_lag = clock - clock_task + clock_pelt
> >
> > And we can then write the missing time as follow:
> >
> >   A + B = sched_clock_cpu() - clock_pelt_lag
> >
> > The B. part of the missing time is however an estimation that doesn't take
> > into account IRQ and Paravirt time.
> >
> > Now we have an estimation for A + B, we can create an estimator for the
> > PELT value at the time of the migration. We need for this purpose to
> > inject last_update_time which is a combination of both clock_pelt and
> > lost_idle_time. The latter is a time value which is completely lost form a
> > PELT point of view and must be ignored. And finally, we can write:
> >
> >   rq_clock_pelt_estimator() = last_update_time + A + B
> >                             = last_update_time +
> >                                    sched_clock_cpu() - clock_pelt_lag
> >
> > [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
> >
> > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 06cf7620839a..11c6aeef4583 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -618,6 +618,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> >         }
> >  }
> >
> > +static void update_rq_clock_pelt_lag(struct rq *rq)
> > +{
> > +       u64_u32_store(rq->clock_pelt_lag,
> > +                     rq->clock - rq->clock_task + rq->clock_pelt);
> 
> This has several shortfalls:
> - have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
> name clock_pelt in your commit message and is used to update PELT and
> saved in se->avg.last_update_time is : rq->clock_pelt -
> rq->lost_idle_time - cfs_rq->throttled_clock_task_time

That's why, the PELT "lag" is added onto se->avg.last_update_time. (see the last
paragraph of the commit message) The estimator is just a time delta, that is
added on top of the entity's last_update_time. I don't see any problem with the
lost_idle_time here.

> - you are doing this whatever the state of the cpu : idle or not. But
> the clock cycles are not accounted for in the same way in both cases.

If the CPU is idle and clock_pelt == clock_task, the component A of the
estimator would be 0 and we only would account for how outdated is the rq's
clock, i.e. component B. 

> - (B) doesn't seem to be accurate as you skip irq and steal time
> accounting and you don't apply any scale invariance if the cpu is not
> idle

The missing irq and paravirt time is the reason why it is called "estimator".
But maybe there's a chance of improving this part with a lockless version of
rq->prev_irq_time and rq->prev_steal_time_rq?

> - IIUC your explanation in the commit message above, the (A) period
> seems to be a problem only when idle but you apply it unconditionally.

If the CPU is idle (and clock_pelt == clock_task), only the B part would be
worth something:

  A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
                      A                            B

> If cpu is idle you can assume that clock_pelt should be equal to
> clock_task but you can't if cpu is not idle otherwise your sync will
> be inaccurate and defeat the primary goal of this patch. If your
> problem with clock_pelt is that the pending idle time is not accounted
> for when entering idle but only at the next update (update blocked
> load or wakeup of a thread). This patch below should fix this and
> remove your A.

That would help slightly the current situation, but this part is already
covered by the estimator.

> 
> 
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index e06071bf3472..855877be4dd8 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -114,6 +114,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
>  {
>         u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT)
> - LOAD_AVG_MAX;
>         u32 util_sum = rq->cfs.avg.util_sum;
> +       u64 now = rq_clock_task(rq);
>         util_sum += rq->avg_rt.util_sum;
>         util_sum += rq->avg_dl.util_sum;
> 
> @@ -127,7 +128,10 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
>          * rq's clock_task.
>          */
>         if (util_sum >= divider)
> -               rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> +               rq->lost_idle_time += now - rq->clock_pelt;
> +
> +       /* The rq is idle, we can sync to clock_task */
> +       rq->clock_pelt  = now;
>  }
> 
>  static inline u64 rq_clock_pelt(struct rq *rq)
> 
> ---
> 
> 
> > +}
> > +
> >  /*
> >   * RQ-clock updating methods:
> >   */
> > @@ -674,6 +680,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >                 update_irq_load_avg(rq, irq_delta + steal);
> >  #endif
> >         update_rq_clock_pelt(rq, delta);
> > +       update_rq_clock_pelt_lag(rq);
> >  }
> >
> >  void update_rq_clock(struct rq *rq)
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 99ea9540ece4..046d5397eb8a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6852,6 +6852,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >
> >  static void detach_entity_cfs_rq(struct sched_entity *se);
> >
> > +static u64 rq_clock_pelt_estimator(struct rq *rq, u64 last_update_time)
> > +{
> > +       u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
> > +                      u64_u32_load(rq->clock_pelt_lag);
> 
> Have you evaluated the impact of calling sched_clock_cpu(cpu_of(rq))
> for a remote cpu ? especially with a huge number of migration and
> concurrent access from several cpus

I have not, but I will have a look.

> 
> > +
> > +       return last_update_time + pelt_lag;
> > +}
> > +
> >  /*
> >   * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
> >   * cfs_rq_of(p) references at time of call are still valid and identify the
> > @@ -6859,6 +6867,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
> >   */
> >  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> >  {
> > +       struct sched_entity *se = &p->se;
> > +       struct rq *rq = task_rq(p);
> > +
> >         /*
> >          * As blocked tasks retain absolute vruntime the migration needs to
> >          * deal with this by subtracting the old and adding the new
> > @@ -6866,7 +6877,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> >          * the task on the new runqueue.
> >          */
> >         if (READ_ONCE(p->__state) == TASK_WAKING) {
> > -               struct sched_entity *se = &p->se;
> >                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >
> >                 se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> > @@ -6877,26 +6887,32 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> >                  * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
> >                  * rq->lock and can modify state directly.
> >                  */
> > -               lockdep_assert_rq_held(task_rq(p));
> > -               detach_entity_cfs_rq(&p->se);
> > +               lockdep_assert_rq_held(rq);
> > +               detach_entity_cfs_rq(se);
> >
> >         } else {
> > +               u64 now;
> > +
> > +               remove_entity_load_avg(se);
> > +
> >                 /*
> > -                * We are supposed to update the task to "current" time, then
> > -                * its up to date and ready to go to new CPU/cfs_rq. But we
> > -                * have difficulty in getting what current time is, so simply
> > -                * throw away the out-of-date time. This will result in the
> > -                * wakee task is less decayed, but giving the wakee more load
> > -                * sounds not bad.
> > +                * Here, the task's PELT values have been updated according to
> > +                * the current rq's clock. But if that clock hasn't been
> > +                * updated in a while, a substantial idle time will be missed,
> > +                * leading to an inflation after wake-up on the new rq.
> > +                *
> > +                * Estimate the PELT clock lag, and update sched_avg to ensure
> > +                * PELT continuity after migration.
> >                  */
> > -               remove_entity_load_avg(&p->se);
> > +               now = rq_clock_pelt_estimator(rq, se->avg.last_update_time);
> > +               __update_load_avg_blocked_se(now, se);
> >         }
> >
> >         /* Tell new CPU we are migrated */
> > -       p->se.avg.last_update_time = 0;
> > +       se->avg.last_update_time = 0;
> >
> >         /* We have migrated, no longer consider this task hot */
> > -       p->se.exec_start = 0;
> > +       se->exec_start = 0;
> >
> >         update_scan_period(p, new_cpu);
> >  }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index f1a445efdc63..fdf2a9e54c0e 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1027,8 +1027,13 @@ struct rq {
> >         /* Ensure that all clocks are in the same cache line */
> >         u64                     clock_task ____cacheline_aligned;
> >         u64                     clock_pelt;
> > +       u64                     clock_pelt_lag;
> >         unsigned long           lost_idle_time;
> >
> > +#ifndef CONFIG_64BIT
> > +       u64                     clock_pelt_lag_copy;
> > +#endif
> > +
> >         atomic_t                nr_iowait;
> >
> >  #ifdef CONFIG_SCHED_DEBUG
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-18 10:56     ` Vincent Donnefort
@ 2022-01-19  9:54       ` Vincent Guittot
  2022-01-19 11:59         ` Vincent Donnefort
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2022-01-19  9:54 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

On Tue, 18 Jan 2022 at 11:56, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> On Mon, Jan 17, 2022 at 06:31:25PM +0100, Vincent Guittot wrote:
> > On Wed, 12 Jan 2022 at 17:14, Vincent Donnefort
> > <vincent.donnefort@arm.com> wrote:
> > >
> > > Before being migrated to a new CPU, a task sees its PELT values
> > > synchronized with rq last_update_time. Once done, that same task will also
> > > have its sched_avg last_update_time reset. This means the time between
> > > the migration and the last clock update (B) will not be accounted for in
> > > util_avg and a discontinuity will appear. This issue is amplified by the
> > > PELT clock scaling. If the clock hasn't been updated while the CPU is
> > > idle, clock_pelt will not be aligned with clock_task and that time (A)
> > > will be also lost.
> > >
> > >    ---------|----- A -----|-----------|------- B -----|>
> > >         clock_pelt   clock_task     clock            now
> > >
> > > This is especially problematic for asymmetric CPU capacity systems which
> > > need stable util_avg signals for task placement and energy estimation.
> > >
> > > Ideally, this problem would be solved by updating the runqueue clocks
> > > before the migration. But that would require taking the runqueue lock
> > > which is quite expensive [1]. Instead estimate the missing time and update
> > > the task util_avg with that value:
> > >
> > >   A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
> > >
> > > Neither clock_task, clock_pelt nor clock can be accessed without the
> > > runqueue lock. The new runqueue clock_pelt_lag is therefore created and
> > > encode those three values.
> > >
> > >   clock_pelt_lag = clock - clock_task + clock_pelt
> > >
> > > And we can then write the missing time as follow:
> > >
> > >   A + B = sched_clock_cpu() - clock_pelt_lag
> > >
> > > The B. part of the missing time is however an estimation that doesn't take
> > > into account IRQ and Paravirt time.
> > >
> > > Now we have an estimation for A + B, we can create an estimator for the
> > > PELT value at the time of the migration. We need for this purpose to
> > > inject last_update_time which is a combination of both clock_pelt and
> > > lost_idle_time. The latter is a time value which is completely lost form a
> > > PELT point of view and must be ignored. And finally, we can write:
> > >
> > >   rq_clock_pelt_estimator() = last_update_time + A + B
> > >                             = last_update_time +
> > >                                    sched_clock_cpu() - clock_pelt_lag
> > >
> > > [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
> > >
> > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 06cf7620839a..11c6aeef4583 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -618,6 +618,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> > >         }
> > >  }
> > >
> > > +static void update_rq_clock_pelt_lag(struct rq *rq)
> > > +{
> > > +       u64_u32_store(rq->clock_pelt_lag,
> > > +                     rq->clock - rq->clock_task + rq->clock_pelt);
> >
> > This has several shortfalls:
> > - have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
> > name clock_pelt in your commit message and is used to update PELT and
> > saved in se->avg.last_update_time is : rq->clock_pelt -
> > rq->lost_idle_time - cfs_rq->throttled_clock_task_time
>
> That's why, the PELT "lag" is added onto se->avg.last_update_time. (see the last
> paragraph of the commit message) The estimator is just a time delta, that is
> added on top of the entity's last_update_time. I don't see any problem with the
> lost_idle_time here.

lost_idle_time is updated before entering idle and after your
clock_pelt_lag has been updated. This means that the delta that you
are computing can be wrong

I haven't look in details but similar problem probably happens for
throttled_clock_task_time

>
> > - you are doing this whatever the state of the cpu : idle or not. But
> > the clock cycles are not accounted for in the same way in both cases.
>
> If the CPU is idle and clock_pelt == clock_task, the component A of the
> estimator would be 0 and we only would account for how outdated is the rq's
> clock, i.e. component B.

And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task

>
> > - (B) doesn't seem to be accurate as you skip irq and steal time
> > accounting and you don't apply any scale invariance if the cpu is not
> > idle
>
> The missing irq and paravirt time is the reason why it is called "estimator".
> But maybe there's a chance of improving this part with a lockless version of
> rq->prev_irq_time and rq->prev_steal_time_rq?
>
> > - IIUC your explanation in the commit message above, the (A) period
> > seems to be a problem only when idle but you apply it unconditionally.
>
> If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> worth something:
>
>   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
>                       A                            B
>
> > If cpu is idle you can assume that clock_pelt should be equal to
> > clock_task but you can't if cpu is not idle otherwise your sync will
> > be inaccurate and defeat the primary goal of this patch. If your
> > problem with clock_pelt is that the pending idle time is not accounted
> > for when entering idle but only at the next update (update blocked
> > load or wakeup of a thread). This patch below should fix this and
> > remove your A.
>
> That would help slightly the current situation, but this part is already
> covered by the estimator.

But the estimator, as you name it, is wrong beaus ethe A part can't be
applied unconditionally

>
> >
> >
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index e06071bf3472..855877be4dd8 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -114,6 +114,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> >  {
> >         u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT)
> > - LOAD_AVG_MAX;
> >         u32 util_sum = rq->cfs.avg.util_sum;
> > +       u64 now = rq_clock_task(rq);
> >         util_sum += rq->avg_rt.util_sum;
> >         util_sum += rq->avg_dl.util_sum;
> >
> > @@ -127,7 +128,10 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> >          * rq's clock_task.
> >          */
> >         if (util_sum >= divider)
> > -               rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > +               rq->lost_idle_time += now - rq->clock_pelt;
> > +
> > +       /* The rq is idle, we can sync to clock_task */
> > +       rq->clock_pelt  = now;
> >  }
> >
> >  static inline u64 rq_clock_pelt(struct rq *rq)
> >
> > ---
> >
> >
> > > +}
> > > +
> > >  /*
> > >   * RQ-clock updating methods:
> > >   */
> > > @@ -674,6 +680,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > >                 update_irq_load_avg(rq, irq_delta + steal);
> > >  #endif
> > >         update_rq_clock_pelt(rq, delta);
> > > +       update_rq_clock_pelt_lag(rq);
> > >  }
> > >
> > >  void update_rq_clock(struct rq *rq)
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 99ea9540ece4..046d5397eb8a 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6852,6 +6852,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >
> > >  static void detach_entity_cfs_rq(struct sched_entity *se);
> > >
> > > +static u64 rq_clock_pelt_estimator(struct rq *rq, u64 last_update_time)
> > > +{
> > > +       u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
> > > +                      u64_u32_load(rq->clock_pelt_lag);
> >
> > Have you evaluated the impact of calling sched_clock_cpu(cpu_of(rq))
> > for a remote cpu ? especially with a huge number of migration and
> > concurrent access from several cpus
>
> I have not, but I will have a look.
>
> >
> > > +
> > > +       return last_update_time + pelt_lag;
> > > +}
> > > +
> > >  /*
> > >   * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
> > >   * cfs_rq_of(p) references at time of call are still valid and identify the
> > > @@ -6859,6 +6867,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
> > >   */
> > >  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> > >  {
> > > +       struct sched_entity *se = &p->se;
> > > +       struct rq *rq = task_rq(p);
> > > +
> > >         /*
> > >          * As blocked tasks retain absolute vruntime the migration needs to
> > >          * deal with this by subtracting the old and adding the new
> > > @@ -6866,7 +6877,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> > >          * the task on the new runqueue.
> > >          */
> > >         if (READ_ONCE(p->__state) == TASK_WAKING) {
> > > -               struct sched_entity *se = &p->se;
> > >                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > >
> > >                 se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> > > @@ -6877,26 +6887,32 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> > >                  * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
> > >                  * rq->lock and can modify state directly.
> > >                  */
> > > -               lockdep_assert_rq_held(task_rq(p));
> > > -               detach_entity_cfs_rq(&p->se);
> > > +               lockdep_assert_rq_held(rq);
> > > +               detach_entity_cfs_rq(se);
> > >
> > >         } else {
> > > +               u64 now;
> > > +
> > > +               remove_entity_load_avg(se);
> > > +
> > >                 /*
> > > -                * We are supposed to update the task to "current" time, then
> > > -                * its up to date and ready to go to new CPU/cfs_rq. But we
> > > -                * have difficulty in getting what current time is, so simply
> > > -                * throw away the out-of-date time. This will result in the
> > > -                * wakee task is less decayed, but giving the wakee more load
> > > -                * sounds not bad.
> > > +                * Here, the task's PELT values have been updated according to
> > > +                * the current rq's clock. But if that clock hasn't been
> > > +                * updated in a while, a substantial idle time will be missed,
> > > +                * leading to an inflation after wake-up on the new rq.
> > > +                *
> > > +                * Estimate the PELT clock lag, and update sched_avg to ensure
> > > +                * PELT continuity after migration.
> > >                  */
> > > -               remove_entity_load_avg(&p->se);
> > > +               now = rq_clock_pelt_estimator(rq, se->avg.last_update_time);
> > > +               __update_load_avg_blocked_se(now, se);
> > >         }
> > >
> > >         /* Tell new CPU we are migrated */
> > > -       p->se.avg.last_update_time = 0;
> > > +       se->avg.last_update_time = 0;
> > >
> > >         /* We have migrated, no longer consider this task hot */
> > > -       p->se.exec_start = 0;
> > > +       se->exec_start = 0;
> > >
> > >         update_scan_period(p, new_cpu);
> > >  }
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index f1a445efdc63..fdf2a9e54c0e 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -1027,8 +1027,13 @@ struct rq {
> > >         /* Ensure that all clocks are in the same cache line */
> > >         u64                     clock_task ____cacheline_aligned;
> > >         u64                     clock_pelt;
> > > +       u64                     clock_pelt_lag;
> > >         unsigned long           lost_idle_time;
> > >
> > > +#ifndef CONFIG_64BIT
> > > +       u64                     clock_pelt_lag_copy;
> > > +#endif
> > > +
> > >         atomic_t                nr_iowait;
> > >
> > >  #ifdef CONFIG_SCHED_DEBUG
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-19  9:54       ` Vincent Guittot
@ 2022-01-19 11:59         ` Vincent Donnefort
  2022-01-19 13:22           ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-19 11:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

[...]

> > >
> > > This has several shortfalls:
> > > - have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
> > > name clock_pelt in your commit message and is used to update PELT and
> > > saved in se->avg.last_update_time is : rq->clock_pelt -
> > > rq->lost_idle_time - cfs_rq->throttled_clock_task_time
> >
> > That's why, the PELT "lag" is added onto se->avg.last_update_time. (see the last
> > paragraph of the commit message) The estimator is just a time delta, that is
> > added on top of the entity's last_update_time. I don't see any problem with the
> > lost_idle_time here.
> 
> lost_idle_time is updated before entering idle and after your
> clock_pelt_lag has been updated. This means that the delta that you
> are computing can be wrong
> 
> I haven't look in details but similar problem probably happens for
> throttled_clock_task_time
> 
> >
> > > - you are doing this whatever the state of the cpu : idle or not. But
> > > the clock cycles are not accounted for in the same way in both cases.
> >
> > If the CPU is idle and clock_pelt == clock_task, the component A of the
> > estimator would be 0 and we only would account for how outdated is the rq's
> > clock, i.e. component B.
> 
> And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> 
> >
> > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > accounting and you don't apply any scale invariance if the cpu is not
> > > idle
> >
> > The missing irq and paravirt time is the reason why it is called "estimator".
> > But maybe there's a chance of improving this part with a lockless version of
> > rq->prev_irq_time and rq->prev_steal_time_rq?
> >
> > > - IIUC your explanation in the commit message above, the (A) period
> > > seems to be a problem only when idle but you apply it unconditionally.
> >
> > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > worth something:
> >
> >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> >                       A                            B
> >
> > > If cpu is idle you can assume that clock_pelt should be equal to
> > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > be inaccurate and defeat the primary goal of this patch. If your
> > > problem with clock_pelt is that the pending idle time is not accounted
> > > for when entering idle but only at the next update (update blocked
> > > load or wakeup of a thread). This patch below should fix this and
> > > remove your A.
> >
> > That would help slightly the current situation, but this part is already
> > covered by the estimator.
> 
> But the estimator, as you name it, is wrong beaus ethe A part can't be
> applied unconditionally

Hum, it is used only in the !active migration. So we know the task was sleeping
before that migration. As a consequence, the time we need to account is "sleeping"
time from the task point of view, which is clock_pelt == clock_task (for
__update_load_avg_blocked_se()). Otherwise, we would only decay with the
"wallclock" idle time instead of the "scaled" one wouldn't we?


     +-------------+-------------- 
     |   Task A    |    Task B    .....
              ^    ^             ^
              |    |          migrate A
	      |    |             |
              |    |             |
              |    |             |
	      |    |<----------->| 
              |  Wallclock Task A idle time
              |<---------------->|
	    "Scaled" Task A idle time


[...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-19 11:59         ` Vincent Donnefort
@ 2022-01-19 13:22           ` Vincent Guittot
  2022-01-20 21:12             ` Vincent Donnefort
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2022-01-19 13:22 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

On Wed, 19 Jan 2022 at 12:59, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> [...]
>
> > > >
> > > > This has several shortfalls:
> > > > - have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
> > > > name clock_pelt in your commit message and is used to update PELT and
> > > > saved in se->avg.last_update_time is : rq->clock_pelt -
> > > > rq->lost_idle_time - cfs_rq->throttled_clock_task_time
> > >
> > > That's why, the PELT "lag" is added onto se->avg.last_update_time. (see the last
> > > paragraph of the commit message) The estimator is just a time delta, that is
> > > added on top of the entity's last_update_time. I don't see any problem with the
> > > lost_idle_time here.
> >
> > lost_idle_time is updated before entering idle and after your
> > clock_pelt_lag has been updated. This means that the delta that you
> > are computing can be wrong
> >
> > I haven't look in details but similar problem probably happens for
> > throttled_clock_task_time
> >
> > >
> > > > - you are doing this whatever the state of the cpu : idle or not. But
> > > > the clock cycles are not accounted for in the same way in both cases.
> > >
> > > If the CPU is idle and clock_pelt == clock_task, the component A of the
> > > estimator would be 0 and we only would account for how outdated is the rq's
> > > clock, i.e. component B.
> >
> > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> >
> > >
> > > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > > accounting and you don't apply any scale invariance if the cpu is not
> > > > idle
> > >
> > > The missing irq and paravirt time is the reason why it is called "estimator".
> > > But maybe there's a chance of improving this part with a lockless version of
> > > rq->prev_irq_time and rq->prev_steal_time_rq?
> > >
> > > > - IIUC your explanation in the commit message above, the (A) period
> > > > seems to be a problem only when idle but you apply it unconditionally.
> > >
> > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > > worth something:
> > >
> > >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> > >                       A                            B
> > >
> > > > If cpu is idle you can assume that clock_pelt should be equal to
> > > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > > be inaccurate and defeat the primary goal of this patch. If your
> > > > problem with clock_pelt is that the pending idle time is not accounted
> > > > for when entering idle but only at the next update (update blocked
> > > > load or wakeup of a thread). This patch below should fix this and
> > > > remove your A.
> > >
> > > That would help slightly the current situation, but this part is already
> > > covered by the estimator.
> >
> > But the estimator, as you name it, is wrong beaus ethe A part can't be
> > applied unconditionally
>
> Hum, it is used only in the !active migration. So we know the task was sleeping
> before that migration. As a consequence, the time we need to account is "sleeping"
> time from the task point of view, which is clock_pelt == clock_task (for
> __update_load_avg_blocked_se()). Otherwise, we would only decay with the
> "wallclock" idle time instead of the "scaled" one wouldn't we?

clock_pelt == clock_task only when cpu is idle and after updating
lost_idle_time but you have no idea of the state of the cpu when
migrating the task

>
>
>      +-------------+--------------
>      |   Task A    |    Task B    .....
>               ^    ^             ^
>               |    |          migrate A
>               |    |             |
>               |    |             |
>               |    |             |
>               |    |<----------->|
>               |  Wallclock Task A idle time
>               |<---------------->|
>             "Scaled" Task A idle time
>
>
> [...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-19 13:22           ` Vincent Guittot
@ 2022-01-20 21:12             ` Vincent Donnefort
  2022-01-21 15:27               ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2022-01-20 21:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

[...]

> > > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> > >
> > > >
> > > > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > > > accounting and you don't apply any scale invariance if the cpu is not
> > > > > idle
> > > >
> > > > The missing irq and paravirt time is the reason why it is called "estimator".
> > > > But maybe there's a chance of improving this part with a lockless version of
> > > > rq->prev_irq_time and rq->prev_steal_time_rq?
> > > >
> > > > > - IIUC your explanation in the commit message above, the (A) period
> > > > > seems to be a problem only when idle but you apply it unconditionally.
> > > >
> > > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > > > worth something:
> > > >
> > > >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> > > >                       A                            B
> > > >
> > > > > If cpu is idle you can assume that clock_pelt should be equal to
> > > > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > > > be inaccurate and defeat the primary goal of this patch. If your
> > > > > problem with clock_pelt is that the pending idle time is not accounted
> > > > > for when entering idle but only at the next update (update blocked
> > > > > load or wakeup of a thread). This patch below should fix this and
> > > > > remove your A.
> > > >
> > > > That would help slightly the current situation, but this part is already
> > > > covered by the estimator.
> > >
> > > But the estimator, as you name it, is wrong beaus ethe A part can't be
> > > applied unconditionally
> >
> > Hum, it is used only in the !active migration. So we know the task was sleeping
> > before that migration. As a consequence, the time we need to account is "sleeping"
> > time from the task point of view, which is clock_pelt == clock_task (for
> > __update_load_avg_blocked_se()). Otherwise, we would only decay with the
> > "wallclock" idle time instead of the "scaled" one wouldn't we?
> 
> clock_pelt == clock_task only when cpu is idle and after updating
> lost_idle_time but you have no idea of the state of the cpu when
> migrating the task

I was just applying the time scaling at the task level. Why shall it depends on
the CPU state?

The situation would be as follows:

                    <--X--> <--Y-->
           +-------+-------+-------+
CPUX    ___|   B   |   A   |   B   |___
                                  ^
                               migrate A
                    
In a such scenario, CPUX's PELT clock is indeed scaled. The Task A running
time (X) has already been accounted, so what's left is to get an idle time (Y)
contribution accurate. We would usually rely on the CPU being idle for the
catch-up and that time would be Y + (X - scaled(X)). Without the catch-up, we
would account at the migration, for the sleeping time Y, only (scaled(Y)). Applied
to the same graph as for update_rq_clock_pelt()'s:

 clock_task    | 1| 2| 3| 4| 5| 6| 7| 8|
 clock_pelt    | 1   | 2   | 3   | 4   |  (CPU's running, clock_pelt is scaled)
 expected      | 1   | 2   | 5| 6| 7| 8|
               <---- X ---><--- Y ---->
 Task A -------************----------
                                   ^ 
                               migrate A

Contribution for Task A idle time at the migration (as we know we won't have
another chance to catch-up clock_task later) should be 6, not 2, regardless of
the CPU state.

_But_ indeed, there would be a risk of hitting the lost_idle_time threshold and
decay too much... (which is absolutely not handled in the current version). So
now, if we don't want to bother too much, we could simplify the problem and
say (which is true with NOHZ_IDLE) that if the CPU is running, the clock must
not be that old anyway. So we should only care of the idle case, which is
mitigated with your proposed snippet and I allow to get rid of the [A]
part (clock_task - clock_pelt).

As per sched_clock_cpu() usage, I haven't measured anything yet but notice it's
already used in the wakeup path in ttwu_queue_wakelist().


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
  2022-01-20 21:12             ` Vincent Donnefort
@ 2022-01-21 15:27               ` Vincent Guittot
  0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2022-01-21 15:27 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, Chris.Redpath, qperret,
	Lukasz.Luba

On Thu, 20 Jan 2022 at 22:12, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> [...]
>
> > > > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> > > >
> > > > >
> > > > > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > > > > accounting and you don't apply any scale invariance if the cpu is not
> > > > > > idle
> > > > >
> > > > > The missing irq and paravirt time is the reason why it is called "estimator".
> > > > > But maybe there's a chance of improving this part with a lockless version of
> > > > > rq->prev_irq_time and rq->prev_steal_time_rq?
> > > > >
> > > > > > - IIUC your explanation in the commit message above, the (A) period
> > > > > > seems to be a problem only when idle but you apply it unconditionally.
> > > > >
> > > > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > > > > worth something:
> > > > >
> > > > >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> > > > >                       A                            B
> > > > >
> > > > > > If cpu is idle you can assume that clock_pelt should be equal to
> > > > > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > > > > be inaccurate and defeat the primary goal of this patch. If your
> > > > > > problem with clock_pelt is that the pending idle time is not accounted
> > > > > > for when entering idle but only at the next update (update blocked
> > > > > > load or wakeup of a thread). This patch below should fix this and
> > > > > > remove your A.
> > > > >
> > > > > That would help slightly the current situation, but this part is already
> > > > > covered by the estimator.
> > > >
> > > > But the estimator, as you name it, is wrong beaus ethe A part can't be
> > > > applied unconditionally
> > >
> > > Hum, it is used only in the !active migration. So we know the task was sleeping
> > > before that migration. As a consequence, the time we need to account is "sleeping"
> > > time from the task point of view, which is clock_pelt == clock_task (for
> > > __update_load_avg_blocked_se()). Otherwise, we would only decay with the
> > > "wallclock" idle time instead of the "scaled" one wouldn't we?
> >
> > clock_pelt == clock_task only when cpu is idle and after updating
> > lost_idle_time but you have no idea of the state of the cpu when
> > migrating the task
>
> I was just applying the time scaling at the task level. Why shall it depends on
> the CPU state?
>
> The situation would be as follows:
>
>                     <--X--> <--Y-->
>            +-------+-------+-------+
> CPUX    ___|   B   |   A   |   B   |___
>                                   ^
>                                migrate A
>
> In a such scenario, CPUX's PELT clock is indeed scaled. The Task A running
> time (X) has already been accounted, so what's left is to get an idle time (Y)
> contribution accurate. We would usually rely on the CPU being idle for the
> catch-up and that time would be Y + (X - scaled(X)). Without the catch-up, we
> would account at the migration, for the sleeping time Y, only (scaled(Y)). Applied
> to the same graph as for update_rq_clock_pelt()'s:
>
>  clock_task    | 1| 2| 3| 4| 5| 6| 7| 8|
>  clock_pelt    | 1   | 2   | 3   | 4   |  (CPU's running, clock_pelt is scaled)
>  expected      | 1   | 2   | 5| 6| 7| 8|
>                <---- X ---><--- Y ---->
>  Task A -------************----------
>                                    ^
>                                migrate A
>
> Contribution for Task A idle time at the migration (as we know we won't have
> another chance to catch-up clock_task later) should be 6, not 2, regardless of
> the CPU state.

If task A wakes up on the same CPU, we sync with the scaled clock_pelt
so why using something different here

>
> _But_ indeed, there would be a risk of hitting the lost_idle_time threshold and
> decay too much... (which is absolutely not handled in the current version). So
> now, if we don't want to bother too much, we could simplify the problem and
> say (which is true with NOHZ_IDLE) that if the CPU is running, the clock must
> not be that old anyway. So we should only care of the idle case, which is
> mitigated with your proposed snippet and I allow to get rid of the [A]
> part (clock_task - clock_pelt).
>
> As per sched_clock_cpu() usage, I haven't measured anything yet but notice it's
> already used in the wakeup path in ttwu_queue_wakelist().
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-01-21 15:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
2022-01-17 16:11   ` Tao Zhou
2022-01-17 19:42     ` Vincent Donnefort
2022-01-17 23:44       ` Tao Zhou
2022-01-12 16:12 ` [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Vincent Donnefort
2022-01-17 17:31   ` Vincent Guittot
2022-01-18 10:56     ` Vincent Donnefort
2022-01-19  9:54       ` Vincent Guittot
2022-01-19 11:59         ` Vincent Donnefort
2022-01-19 13:22           ` Vincent Guittot
2022-01-20 21:12             ` Vincent Donnefort
2022-01-21 15:27               ` Vincent Guittot
2022-01-12 16:12 ` [PATCH v2 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
2022-01-12 19:44   ` kernel test robot
2022-01-12 19:44     ` kernel test robot
2022-01-17 13:17   ` Dietmar Eggemann
2022-01-18  9:46     ` Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 7/7] sched/fair: Remove the energy margin " Vincent Donnefort

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.